stackoverflow developer survey
EDA and salary modelling
The aim of this post is to do some exploration of the stack overflow developer survey 2018 dataset and some modelling regarding salary. As a bonus point, we'll discuss some properties of the mean absolute error.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklego.preprocessing import PandasTypeSelector
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error
from lightgbm import LGBMRegressor
from category_encoders import TargetEncoder
so_df = pd.read_csv('stackoverflow.csv', low_memory=False)
Let's have a quick look at the data that we have available
so_df.head()
We can see that most columns are non-numeric. Some of the few columns that are numeric are AssessJob*
. Having a look at the data schema, we can see that users have to rank several features of the job that they're choosing from 1 to 10, where 1 is the most important and 10 is the least important. In the case of AssessJob5
, the thing that is being rated is The compensation and benefits offered. A natural question to ask is that if people who value compensation more actually make more. The answer is in the following query:
(so_df
.groupby('AssessJob5')
.agg({'ConvertedSalary': ['mean', 'median', len]})
)
Indeed, people who rate the compensation as more important make on average more, both in terms of mean and median.
Before asking and attempting to answer more questions regarding the salary, let's make some utility functions that will be useful.
def p10(x: pd.Series) -> float:
"""Function that returns the 10th percentile of a pandas Series"""
return x.quantile(.10)
def p90(x: pd.Series) -> float:
"""Function that returns the 90th percentile of a pandas Series"""
return x.quantile(.90)
def salary_summary(so_df: pd.DataFrame, var: str) -> pd.DataFrame:
"""Summaries several statistics of the salary grouping by a variable"""
return (so_df
.groupby(var)
.agg({'ConvertedSalary': ['min', p10, 'median', p90, 'mean', 'max', len]})
)
salary_summary
describes some statistics with respect to some grouping variable. Let's use YearsCoding
as that variable. We can see that more years coding generally imply higher salaries.
salary_summary(so_df, 'YearsCoding')
We can actually attempt to plot the evolution of the median salary in terms of years spent programming:
so_df['years_coding_int'] = so_df.YearsCoding.map({
'30 or more years': 35,
'6-8 years': 7,
'9-11 years': 10,
'0-2 years': 1,
'15-17 years': 16,
'18-20 years': 19,
'3-5 years': 4,
'12-14 years': 13,
'24-26 years': 25,
'21-23 years': 22,
'27-29 years:': 28
})
year_summary = salary_summary(so_df, 'years_coding_int')
sns.lineplot(year_summary.index, year_summary.ConvertedSalary['median'])
A somehow expected results, the median salary increases fast at first and it slows down after some years. I did expect the saturation to happen earlier, but I was wrong apparently.
We also had a bit of a mess to get the median salary, so we'll create a helper function to create a tidier structure.
def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Tidy the structure of the output of salary_summary"""
df.columns = ["_".join(df) for df in df.columns.ravel()]
return df
Let's see in which countries you can make more money in terms of the median. We'll use the flatten_columns
function to be able to sort in an easier way:
flatten_columns(salary_summary(so_df, 'Country')).\
sort_values('ConvertedSalary_median', ascending=False)
Let's stop for a moment and analyse the issues of what we are doing:
- The code starts to look messy, it's hard to read and reason about: We'll use cleaner code from now on.
- I'd never say that Togo is a top-earning country. An issue that we have is that top countries are the ones that have low statistical mass (the last column is the number of surveys answered in that country): We'll filter the countries that have more than
x
surveys to answer this question propperly.
To write cleaner code, we can use pandas pipes. This was a big lesson to me, now I'm able to write dplyr-like code in pandas.
To solve the issue of low statistical mass, we only consider countries that have more than 40 surveys finished. The results make more sense from a macroeconomic perspective, being the US and Switzerland the top earning countries:
(so_df
.pipe(salary_summary, 'Country')
.pipe(flatten_columns)
.query('ConvertedSalary_len > 40')
.sort_values('ConvertedSalary_median', ascending=False)
.head(10)
)
We can also measure salary inequality in a given country. A common approach is to divide the 90th quantile by the 10th quantile. The bigger this ratio, the higher the inequality. Venezuela, Viet Nam and Nigeria seem to be the most unequal countries:
(so_df
.pipe(salary_summary, 'Country')
.pipe(flatten_columns)
.assign(inequality=lambda x: x['ConvertedSalary_p90'] / x['ConvertedSalary_p10'])
.query('ConvertedSalary_len > 40')
.sort_values('inequality', ascending=False)
.head(10)
)
Some columns have information regarding past experience. For instance, we have DatabaseWorkedWith
. Wether this is null or not indicates if the user has experience working with databases. We're going to have a look at several experience questions to create features regarding no experience.
def na_performance(so_df: pd.DataFrame, var: str) -> pd.DataFrame:
"""Summarizes the salary statistics grouping by a variable being empty"""
return (so_df
.assign(some_experience = lambda x: ~x[var].isnull())
.pipe(salary_summary, 'some_experience')
)
Having db experience increases the salary in mean and median.
(so_df
.pipe(na_performance, 'DatabaseWorkedWith')
)
Having framework experience increases the salary in all statistics.
(so_df
.pipe(na_performance, 'FrameworkWorkedWith')
)
Having language exeprience increases the salary in all statistics (there are few surveys without language experience).
(so_df
.pipe(na_performance, 'LanguageWorkedWith')
)
Having gone to hackathons also helps.
(so_df
.pipe(na_performance, 'HackathonReasons')
)
Bootcamps seem to help too, in all statistics except p10
.
(so_df
.pipe(na_performance, 'TimeAfterBootcamp')
)
I've also heard that Cobol developers make a lot of money. This supports it, although we have low statistical mass for Cobol devs:
so_df['cobol'] = so_df.LanguageWorkedWith.str.contains('Cobol')
salary_summary(so_df, 'cobol')
Let's get started with the modelling. First of all, we'll split our data in train and test sets using a simple random split.
x_df = so_df.drop(columns=['ConvertedSalary', 'Salary'])
y_df = so_df['ConvertedSalary']
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, random_state=42)
Baselines
We're going to try 4 baselines and compare our models to these baselines:
- Use the global mean to approximate the salary (no model).
- Use the global median to approximate the salary (no model).
- Use the country median to approximate the salary (almost no model).
- Use a simple linear regression with the numeric features that we currently have.
We are going to measure the performance of the baselines based on the test data using the mean absolute error as a metric. I like to use the mean absolute error as it is easier to interpret than the RMSE.
global_median = y_train.median()
global_median
global_mean = y_train.mean()
global_mean
Our two dummiest baselines have a MAE of 70k anual dollar salary and 83k. This is pretty high, and we notice that the median is a way better estimator than the mean w.r.t the MAE. This makes sense as the median minimizes the mae.
constant_prediction = [global_median for i in y_test]
mean_absolute_error(constant_prediction, y_test)
constant_prediction_mean = [global_mean for i in y_test]
mean_absolute_error(constant_prediction_mean, y_test)
Let's build a simple model that estimates the salary of a survey using the median of its country.
simple_model = (x_train
.assign(y=y_train)
.groupby('Country')
.agg({'y': 'median'})
.reset_index()
)
simple_predictions = (x_test
.merge(simple_model, how='left')
.loc[:, 'y']
)
# If no training data of that country, just use the global median
simple_predictions[simple_predictions.isnull()] = global_median
The MAE of this model is significantly lower than the other baselines (around 58k).
mean_absolute_error(simple_predictions, y_test)
The last baseline that we are going to use is a simple linear regression with the numeric variables that we have. The linear regression model performs much worse than the dummy country model. I think this can be explained by the fact that the linear model optimizes the sum of squared errors, and this doesn't optimize the MAE in any way.
pipe = Pipeline([
('selector', PandasTypeSelector(include='float64')),
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('lr', LinearRegression())
])
pipe.fit(x_train, y_train)
mean_absolute_error(pipe.predict(x_test), y_test)
We'll compute some features regarding the missingness of experience:
so_df['cobol_int'] = so_df['cobol'].astype(float)
so_df['null_db'] = so_df['DatabaseWorkedWith'].isnull().astype(float)
so_df['null_fwork'] = so_df['FrameworkWorkedWith'].isnull().astype(float)
so_df['null_lang'] = so_df['LanguageWorkedWith'].isnull().astype(float)
so_df['null_hck'] = so_df['HackathonReasons'].isnull().astype(float)
so_df['null_boot'] = so_df['TimeAfterBootcamp'].isnull().astype(float)
And split the data again using the same split (random_state = 42
).
x_df = so_df.drop(columns=['ConvertedSalary', 'Salary'])
y_df = so_df['ConvertedSalary']
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, random_state=42)
Now we use much better modelling techniques:
- New features as described above.
- Target encoding of categorical features.
- Lightgbm instead of linear model.
And measure the cross-validation MAE before keeping a model.
te = TargetEncoder(cols=['Country',
'JobSatisfaction',
'CareerSatisfaction',
'RaceEthnicity',
'CompanySize',
'OperatingSystem'
])
pipe_lgb = Pipeline([
('endoer', te),
('selector', PandasTypeSelector(include='float64')),
('imputer', SimpleImputer(strategy='mean')),
('lr', LGBMRegressor())
])
cv_lgb = cross_val_score(
pipe_lgb,
x_train,
y_train,
scoring='neg_mean_absolute_error',
cv=3
)
- cv_lgb.mean()
Disappointingly, our model is better than the linear regression, but way worse than the country-level model. It looks like all the models that use the median are way better than the models that optimize sums of squares.
Luckily, we can change the lightgbm objective from L2 to L1, and the lightgbm will optimize the mae.
pipe_lgb = Pipeline([
('endoer', te),
('selector', PandasTypeSelector(include='float64')),
('imputer', SimpleImputer(strategy='mean')),
('lr', LGBMRegressor(objective='regression_l1'))
])
cv_lgb = cross_val_score(
pipe_lgb,
x_train,
y_train,
scoring='neg_mean_absolute_error',
cv=3
)
- cv_lgb.mean()
Now we're talking, the cross-validation MAE is lower than in all baselines. Let's see if this is still true in the test data.
pipe_lgb.fit(x_train, y_train)
mean_absolute_error(pipe_lgb.predict(x_test), y_test)
Indeed, the model has improved on test data. MAE has gone from 58k (our best baseline) to 53k. The improvement is not huge, and in a production environment we'd ask if this is necessary, but we now have a slightly better model and we can iterate with more features to improve it even more.
For instance, we haven't used programming language propperly as a feature and this can help the model significantly.