The aim of this post is to do some exploration of the stack overflow developer survey 2018 dataset and some modelling regarding salary. As a bonus point, we'll discuss some properties of the mean absolute error.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None) 


from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklego.preprocessing import PandasTypeSelector
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_absolute_error, mean_squared_error

from lightgbm import LGBMRegressor

from category_encoders import TargetEncoder
so_df = pd.read_csv('stackoverflow.csv', low_memory=False)

Exploratory data analysis

Let's have a quick look at the data that we have available

so_df.head()
Respondent Hobby OpenSource Country Student Employment FormalEducation UndergradMajor CompanySize DevType YearsCoding YearsCodingProf JobSatisfaction CareerSatisfaction HopeFiveYears JobSearchStatus LastNewJob AssessJob1 AssessJob2 AssessJob3 AssessJob4 AssessJob5 AssessJob6 AssessJob7 AssessJob8 AssessJob9 AssessJob10 AssessBenefits1 AssessBenefits2 AssessBenefits3 AssessBenefits4 AssessBenefits5 AssessBenefits6 AssessBenefits7 AssessBenefits8 AssessBenefits9 AssessBenefits10 AssessBenefits11 JobContactPriorities1 JobContactPriorities2 JobContactPriorities3 JobContactPriorities4 JobContactPriorities5 JobEmailPriorities1 JobEmailPriorities2 JobEmailPriorities3 JobEmailPriorities4 JobEmailPriorities5 JobEmailPriorities6 JobEmailPriorities7 UpdateCV Currency Salary SalaryType ConvertedSalary CurrencySymbol CommunicationTools TimeFullyProductive EducationTypes SelfTaughtTypes TimeAfterBootcamp HackathonReasons AgreeDisagree1 AgreeDisagree2 AgreeDisagree3 LanguageWorkedWith LanguageDesireNextYear DatabaseWorkedWith DatabaseDesireNextYear PlatformWorkedWith PlatformDesireNextYear FrameworkWorkedWith FrameworkDesireNextYear IDE OperatingSystem NumberMonitors Methodology VersionControl CheckInCode AdBlocker AdBlockerDisable AdBlockerReasons AdsAgreeDisagree1 AdsAgreeDisagree2 AdsAgreeDisagree3 AdsActions AdsPriorities1 AdsPriorities2 AdsPriorities3 AdsPriorities4 AdsPriorities5 AdsPriorities6 AdsPriorities7 AIDangerous AIInteresting AIResponsible AIFuture EthicsChoice EthicsReport EthicsResponsible EthicalImplications StackOverflowRecommend StackOverflowVisit StackOverflowHasAccount StackOverflowParticipate StackOverflowJobs StackOverflowDevStory StackOverflowJobsRecommend StackOverflowConsiderMember HypotheticalTools1 HypotheticalTools2 HypotheticalTools3 HypotheticalTools4 HypotheticalTools5 WakeTime HoursComputer HoursOutside SkipMeals ErgonomicDevices Exercise Gender SexualOrientation EducationParents RaceEthnicity Age Dependents MilitaryUS SurveyTooLong SurveyEasy
0 3 Yes Yes United Kingdom No Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) A natural science (ex. biology, chemistry, phy... 10,000 or more employees Database administrator;DevOps specialist;Full-... 30 or more years 18-20 years Moderately dissatisfied Neither satisfied nor dissatisfied Working in a different or more specialized tec... I am actively looking for a job More than 4 years ago 1.0 7.0 10.0 8.0 2.0 5.0 4.0 3.0 6.0 9.0 1.0 5.0 3.0 7.0 10.0 4.0 11.0 9.0 6.0 2.0 8.0 3.0 1.0 5.0 2.0 4.0 1.0 3.0 4.0 5.0 2.0 6.0 7.0 I saw an employer’s advertisement British pounds sterling (£) 51000 Yearly 70841.0 GBP Confluence;Office / productivity suite (Micros... One to three months Taught yourself a new language, framework, or ... The official documentation and/or standards fo... NaN NaN Agree Agree Neither Agree nor Disagree JavaScript;Python;Bash/Shell Go;Python Redis;PostgreSQL;Memcached PostgreSQL Linux Linux Django React IPython / Jupyter;Sublime Text;Vim Linux-based 2 NaN Git;Subversion A few times per week Yes Yes The website I was visiting asked me to disable it Somewhat agree Neither agree nor disagree Neither agree nor disagree NaN 3.0 5.0 1.0 4.0 6.0 7.0 2.0 Increasing automation of jobs Increasing automation of jobs The developers or the people creating the AI I'm excited about the possibilities more than ... Depends on what it is Depends on what it is Upper management at the company/organization Yes 10 (Very Likely) A few times per month or weekly Yes A few times per month or weekly Yes No, I have one but it's out of date 7 Yes A little bit interested A little bit interested A little bit interested A little bit interested A little bit interested Between 6:01 - 7:00 AM 5 - 8 hours 30 - 59 minutes Never Ergonomic keyboard or mouse Daily or almost every day Male Straight or heterosexual Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent 35 - 44 years old Yes NaN The survey was an appropriate length Somewhat easy
1 7 Yes No South Africa Yes, part-time Employed full-time Some college/university study without earning ... Computer science, computer engineering, or sof... 10,000 or more employees Data or business analyst;Desktop or enterprise... 6-8 years 0-2 years Slightly satisfied Moderately satisfied Working in a different or more specialized tec... I’m not actively looking, but I am open to new... Between 1 and 2 years ago 8.0 5.0 7.0 1.0 2.0 6.0 4.0 3.0 10.0 9.0 1.0 10.0 2.0 4.0 8.0 3.0 11.0 7.0 5.0 9.0 6.0 2.0 1.0 4.0 5.0 3.0 7.0 3.0 6.0 2.0 1.0 4.0 5.0 My job status or other personal status changed South African rands (R) 260000 Yearly 21426.0 ZAR Office / productivity suite (Microsoft Office,... Three to six months Taken a part-time in-person course in programm... The official documentation and/or standards fo... NaN NaN Strongly agree Agree Strongly disagree C;C++;Java;Matlab;R;SQL;Bash/Shell Assembly;C;C++;Matlab;SQL;Bash/Shell SQL Server;PostgreSQL;Oracle;IBM Db2 PostgreSQL;Oracle;IBM Db2 Arduino;Windows Desktop or Server Arduino;Windows Desktop or Server NaN NaN Notepad++;Visual Studio;Visual Studio Code Windows 2 Evidence-based software engineering;Formal sta... Zip file back-ups Weekly or a few times per month No NaN NaN Somewhat agree Somewhat agree Somewhat disagree Clicked on an online advertisement;Saw an onli... 2.0 3.0 4.0 6.0 1.0 7.0 5.0 Algorithms making important decisions Algorithms making important decisions The developers or the people creating the AI I'm excited about the possibilities more than ... No Yes, but only within the company Upper management at the company/organization Yes 10 (Very Likely) Daily or almost daily Yes Less than once per month or monthly No, I knew that Stack Overflow had a jobs boar... No, I know what it is but I don't have one NaN Yes Extremely interested Extremely interested Extremely interested Extremely interested Extremely interested Before 5:00 AM Over 12 hours 1 - 2 hours Never NaN 3 - 4 times per week Male Straight or heterosexual Some college/university study without earning ... White or of European descent 18 - 24 years old Yes NaN The survey was an appropriate length Somewhat easy
2 8 Yes No United Kingdom No Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... 10 to 19 employees Back-end developer;Database administrator;Fron... 6-8 years 3-5 years Moderately satisfied Slightly satisfied Working in a different or more specialized tec... I am actively looking for a job Between 2 and 4 years ago 8.0 5.0 4.0 9.0 1.0 3.0 6.0 2.0 10.0 7.0 1.0 3.0 4.0 10.0 9.0 2.0 6.0 5.0 11.0 7.0 8.0 4.0 2.0 5.0 1.0 3.0 2.0 6.0 7.0 3.0 1.0 5.0 4.0 I did not receive an expected change in compen... British pounds sterling (£) 30000 NaN 41671.0 GBP Confluence;Jira;Office / productivity suite (M... Less than a month Received on-the-job training in software devel... The official documentation and/or standards fo... NaN To improve my general technical skills or prog... Disagree Neither Agree nor Disagree Strongly disagree Java;JavaScript;Python;TypeScript;HTML;CSS C#;Go;Java;JavaScript;Python;SQL;TypeScript;HT... MongoDB PostgreSQL Linux Linux Angular;Node.js Node.js IntelliJ;PyCharm;Visual Studio Code Linux-based 2 Agile Git A few times per week Yes Yes I wanted to support the website I was visiting... Somewhat agree Somewhat agree Somewhat disagree Saw an online advertisement and then researche... 1.0 3.0 4.0 2.0 7.0 5.0 6.0 Increasing automation of jobs Algorithms making important decisions A governmental or other regulatory body I'm excited about the possibilities more than ... Depends on what it is Depends on what it is Upper management at the company/organization Unsure / I don't know 7 A few times per month or weekly Yes Less than once per month or monthly Yes No, I have one but it's out of date 8 No A little bit interested Not at all interested Very interested Very interested Extremely interested Between 7:01 - 8:00 AM 9 - 12 hours 30 - 59 minutes 1 - 2 times per week NaN 1 - 2 times per week Male Straight or heterosexual Secondary school (e.g. American high school, G... White or of European descent 18 - 24 years old No NaN The survey was an appropriate length Somewhat easy
3 9 Yes Yes United States No Employed full-time Some college/university study without earning ... Computer science, computer engineering, or sof... 10,000 or more employees Back-end developer;Front-end developer;Full-st... 9-11 years 0-2 years Slightly satisfied Moderately satisfied Working as a founder or co-founder of my own c... I’m not actively looking, but I am open to new... Less than a year ago 5.0 3.0 9.0 4.0 1.0 8.0 2.0 7.0 10.0 6.0 1.0 3.0 2.0 9.0 11.0 4.0 8.0 6.0 7.0 10.0 5.0 3.0 1.0 5.0 4.0 2.0 1.0 5.0 3.0 4.0 2.0 6.0 7.0 My job status or other personal status changed U.S. dollars ($) 120000 Yearly 120000.0 USD Confluence;Office / productivity suite (Micros... Six to nine months Received on-the-job training in software devel... The official documentation and/or standards fo... NaN NaN Disagree Agree Strongly disagree JavaScript;HTML;CSS C;Go;JavaScript;Python;HTML;CSS MongoDB NaN Linux Linux Node.js;React React;TensorFlow Atom;Visual Studio Code MacOS 2 Agile;Scrum Git Multiple times per day Yes Yes The ad-blocking software was causing display i... Somewhat disagree Neither agree nor disagree Somewhat disagree Clicked on an online advertisement;Stopped goi... 1.0 4.0 2.0 5.0 3.0 7.0 6.0 Algorithms making important decisions Artificial intelligence surpassing human intel... The developers or the people creating the AI I'm excited about the possibilities more than ... Depends on what it is Yes, but only within the company Upper management at the company/organization Yes 10 (Very Likely) Multiple times per day Yes I have never participated in Q&A on Stack Over... Yes No, I have one but it's out of date 7 No Very interested A little bit interested Extremely interested Very interested Very interested Between 9:01 - 10:00 AM Over 12 hours Less than 30 minutes 1 - 2 times per week NaN I don't typically exercise Male Straight or heterosexual Master’s degree (MA, MS, M.Eng., MBA, etc.) White or of European descent 18 - 24 years old No No The survey was an appropriate length Somewhat easy
4 11 Yes Yes United States No Employed full-time Some college/university study without earning ... Fine arts or performing arts (ex. graphic desi... 100 to 499 employees Back-end developer;C-suite executive (CEO, CTO... 30 or more years 21-23 years Moderately satisfied Moderately satisfied Doing the same work I’m not actively looking, but I am open to new... Between 2 and 4 years ago 6.0 3.0 7.0 4.0 1.0 5.0 10.0 8.0 9.0 2.0 1.0 3.0 2.0 9.0 11.0 5.0 8.0 4.0 10.0 7.0 6.0 5.0 1.0 2.0 3.0 4.0 3.0 7.0 2.0 4.0 1.0 6.0 5.0 My job status or other personal status changed U.S. dollars ($) 250000 Yearly 250000.0 USD Confluence;HipChat;Jira;Office / productivity ... Three to six months Taken an online course in programming or softw... The official documentation and/or standards fo... NaN Because I find it enjoyable Strongly agree Strongly disagree Strongly disagree Assembly;CoffeeScript;Erlang;Go;JavaScript;Lua... Erlang;Go;Python;Rust;SQL Redis;PostgreSQL;Amazon DynamoDB;Apache HBase;... Redis;PostgreSQL;Amazon DynamoDB;Apache Hive;A... Amazon Echo;AWS;iOS;Linux;Mac OS;Serverless AWS;Linux;Mac OS;Serverless Hadoop;Node.js;React;Spark NaN IntelliJ;PyCharm;Sublime Text;Vim MacOS 1 Agile;Evidence-based software engineering;Extr... Git Multiple times per day No NaN NaN Neither agree nor disagree Somewhat agree Neither agree nor disagree Clicked on an online advertisement;Saw an onli... 1.0 3.0 5.0 4.0 2.0 7.0 6.0 Algorithms making important decisions Artificial intelligence surpassing human intel... The developers or the people creating the AI I'm worried about the dangers more than I'm ex... No Yes, and publicly The person who came up with the idea Yes 7 A few times per month or weekly Yes Less than once per month or monthly Yes Yes 7 No Not at all interested Not at all interested Not at all interested Not at all interested Not at all interested Between 8:01 - 9:00 AM 9 - 12 hours Less than 30 minutes 1 - 2 times per week Standing desk;Fatigue-relieving floor mat I don't typically exercise Male Straight or heterosexual Some college/university study without earning ... White or of European descent 35 - 44 years old Yes No The survey was an appropriate length Very easy

We can see that most columns are non-numeric. Some of the few columns that are numeric are AssessJob*. Having a look at the data schema, we can see that users have to rank several features of the job that they're choosing from 1 to 10, where 1 is the most important and 10 is the least important. In the case of AssessJob5, the thing that is being rated is The compensation and benefits offered. A natural question to ask is that if people who value compensation more actually make more. The answer is in the following query:

(so_df
 .groupby('AssessJob5')
 .agg({'ConvertedSalary': ['mean', 'median', len]})
)
ConvertedSalary
mean median len
AssessJob5
1.0 114636.509475 67512.0 9235.0
2.0 104268.116033 63257.0 8782.0
3.0 94250.374226 58340.0 7263.0
4.0 89597.748557 52872.0 5719.0
5.0 89341.969460 51408.0 4355.0
6.0 81042.514268 45420.0 3259.0
7.0 88240.893902 41124.0 2460.0
8.0 74418.464066 37572.0 1948.0
9.0 80024.848718 35178.0 1560.0
10.0 66614.714586 28800.0 953.0

Indeed, people who rate the compensation as more important make on average more, both in terms of mean and median.

Before asking and attempting to answer more questions regarding the salary, let's make some utility functions that will be useful.

def p10(x: pd.Series) -> float:
    """Function that returns the 10th percentile of a pandas Series"""
    return x.quantile(.10)

def p90(x: pd.Series) -> float:
    """Function that returns the 90th percentile of a pandas Series"""
    return x.quantile(.90)

def salary_summary(so_df: pd.DataFrame, var: str) -> pd.DataFrame:
    """Summaries several statistics of the salary grouping by a variable"""
    return (so_df
            .groupby(var)
            .agg({'ConvertedSalary': ['min', p10, 'median', p90, 'mean', 'max', len]})
           )

salary_summary describes some statistics with respect to some grouping variable. Let's use YearsCoding as that variable. We can see that more years coding generally imply higher salaries.

salary_summary(so_df, 'YearsCoding')
ConvertedSalary
min p10 median p90 mean max len
YearsCoding
0-2 years 0.0 1200.0 13212.0 76119.1 51609.611982 2000000.0 3188.0
12-14 years 0.0 23094.3 67313.5 148801.9 109875.912365 2000000.0 4998.0
15-17 years 0.0 31300.6 76626.5 165000.0 124343.801985 2000000.0 3828.0
18-20 years 0.0 35244.0 85000.0 175000.0 128760.426768 2000000.0 3168.0
21-23 years 0.0 41229.0 95552.0 175000.0 133128.075183 2000000.0 1636.0
24-26 years 0.0 38798.4 96000.0 200000.0 145374.963762 2000000.0 1159.0
27-29 years 0.0 41671.0 96000.0 200000.0 153732.023772 2000000.0 631.0
3-5 years 0.0 4128.0 29376.0 96000.0 66707.942685 2000000.0 9788.0
30 or more years 0.0 48616.0 105000.0 205586.2 155413.456875 2000000.0 2029.0
6-8 years 0.0 7725.9 44287.0 115500.0 83373.084229 2000000.0 10234.0
9-11 years 0.0 14662.5 55392.0 130000.0 94310.455866 2000000.0 7024.0

We can actually attempt to plot the evolution of the median salary in terms of years spent programming:

so_df['years_coding_int'] = so_df.YearsCoding.map({
    '30 or more years': 35, 
    '6-8 years': 7, 
    '9-11 years': 10, 
    '0-2 years': 1,
    '15-17 years': 16,
    '18-20 years': 19,
    '3-5 years': 4,
    '12-14 years': 13,
    '24-26 years': 25, 
    '21-23 years': 22, 
    '27-29 years:': 28
})
year_summary = salary_summary(so_df, 'years_coding_int')
sns.lineplot(year_summary.index, year_summary.ConvertedSalary['median'])
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f972f60>

A somehow expected results, the median salary increases fast at first and it slows down after some years. I did expect the saturation to happen earlier, but I was wrong apparently.

We also had a bit of a mess to get the median salary, so we'll create a helper function to create a tidier structure.

def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Tidy the structure of the output of salary_summary"""
    df.columns = ["_".join(df) for df in df.columns.ravel()]
    return df

Country evaluation

Let's see in which countries you can make more money in terms of the median. We'll use the flatten_columns function to be able to sort in an easier way:

flatten_columns(salary_summary(so_df, 'Country')).\
sort_values('ConvertedSalary_median', ascending=False)
ConvertedSalary_min ConvertedSalary_p10 ConvertedSalary_median ConvertedSalary_p90 ConvertedSalary_mean ConvertedSalary_max ConvertedSalary_len
Country
Andorra 48955.0 49689.4 525701.5 1000000.0 525089.500000 1000000.0 4.0
Botswana 140000.0 146000.0 170000.0 194000.0 170000.000000 200000.0 2.0
Saint Lucia 159852.0 159852.0 159852.0 159852.0 159852.000000 159852.0 1.0
Togo 24000.0 51095.5 159477.5 267859.5 159477.500000 294955.0 2.0
Sierra Leone 120000.0 120000.0 120000.0 120000.0 120000.000000 120000.0 1.0
... ... ... ... ... ... ... ...
Swaziland 1488.0 1586.4 1980.0 3564.0 2476.000000 3960.0 3.0
Cuba 156.0 794.4 1944.0 58800.0 62919.368421 1000000.0 19.0
Zambia 1824.0 1824.0 1824.0 1824.0 1824.000000 1824.0 1.0
Dominica 1000.0 1000.0 1000.0 1000.0 1000.000000 1000.0 1.0
Monaco 0.0 0.0 0.0 0.0 0.000000 0.0 1.0

157 rows × 7 columns

Let's stop for a moment and analyse the issues of what we are doing:

  • The code starts to look messy, it's hard to read and reason about: We'll use cleaner code from now on.
  • I'd never say that Togo is a top-earning country. An issue that we have is that top countries are the ones that have low statistical mass (the last column is the number of surveys answered in that country): We'll filter the countries that have more than x surveys to answer this question propperly.

To write cleaner code, we can use pandas pipes. This was a big lesson to me, now I'm able to write dplyr-like code in pandas.

To solve the issue of low statistical mass, we only consider countries that have more than 40 surveys finished. The results make more sense from a macroeconomic perspective, being the US and Switzerland the top earning countries:

(so_df
 .pipe(salary_summary, 'Country')
 .pipe(flatten_columns)
 .query('ConvertedSalary_len > 40')
 .sort_values('ConvertedSalary_median', ascending=False)
 .head(10)
)
ConvertedSalary_min ConvertedSalary_p10 ConvertedSalary_median ConvertedSalary_p90 ConvertedSalary_mean ConvertedSalary_max ConvertedSalary_len
Country
United States 0.0 50000.0 100000.0 175000.0 154413.128506 2000000.0 12941.0
Switzerland 0.0 20938.4 93842.0 162451.6 138111.567812 1000000.0 553.0
Israel 0.0 35112.0 90642.0 140392.4 99560.527149 1248000.0 442.0
Norway 0.0 53152.0 82890.0 183633.2 150544.824926 1000000.0 337.0
Denmark 0.0 37640.0 82836.0 139878.0 100477.157303 1000000.0 356.0
Australia 0.0 41506.0 79973.0 149950.0 126623.720339 1000000.0 1180.0
Ireland 0.0 35492.0 74657.0 734328.0 186313.216718 1000000.0 323.0
New Zealand 0.0 33619.0 65776.0 166632.0 133835.951768 1000000.0 311.0
Canada 0.0 32209.0 64417.0 112730.0 100894.343419 2000000.0 1983.0
United Kingdom 0.0 31253.0 62507.0 287155.2 128216.930944 1200000.0 3794.0

We can also measure salary inequality in a given country. A common approach is to divide the 90th quantile by the 10th quantile. The bigger this ratio, the higher the inequality. Venezuela, Viet Nam and Nigeria seem to be the most unequal countries:

(so_df
 .pipe(salary_summary, 'Country')
 .pipe(flatten_columns)
 .assign(inequality=lambda x: x['ConvertedSalary_p90'] / x['ConvertedSalary_p10'])
 .query('ConvertedSalary_len > 40')
 .sort_values('inequality', ascending=False)
 .head(10)
)
ConvertedSalary_min ConvertedSalary_p10 ConvertedSalary_median ConvertedSalary_p90 ConvertedSalary_mean ConvertedSalary_max ConvertedSalary_len inequality
Country
Venezuela, Bolivarian Republic of... 24.0 444.0 8100.0 1000000.0 241823.625000 1000000.0 48.0 2252.252252
Viet Nam 0.0 792.0 9516.0 38400.0 19001.617284 285744.0 81.0 48.484848
Nigeria 0.0 829.0 6000.0 39196.8 23543.699248 1000000.0 133.0 47.282027
Iran, Islamic Republic of... 0.0 660.0 10000.0 25509.6 15644.191419 1000000.0 303.0 38.650909
Other Country (Not Listed Above) 0.0 6273.6 35274.5 162512.0 94657.119048 1056000.0 42.0 25.904106
Kenya 0.0 1922.4 12000.0 44400.0 30356.666667 360000.0 57.0 23.096130
Morocco 0.0 2592.0 12000.0 54046.4 20565.981132 100000.0 53.0 20.851235
India 0.0 2256.0 9396.0 46963.0 27694.121486 2000000.0 4091.0 20.816933
Ireland 0.0 35492.0 74657.0 734328.0 186313.216718 1000000.0 323.0 20.689958
Italy 0.0 20235.6 39164.0 327506.4 95691.657754 1000000.0 748.0 16.184665

Skills evaluation

Some columns have information regarding past experience. For instance, we have DatabaseWorkedWith. Wether this is null or not indicates if the user has experience working with databases. We're going to have a look at several experience questions to create features regarding no experience.

def na_performance(so_df: pd.DataFrame, var: str) -> pd.DataFrame:
    """Summarizes the salary statistics grouping by a variable being empty"""
    
    return (so_df
            .assign(some_experience = lambda x: ~x[var].isnull())
            .pipe(salary_summary, 'some_experience')
    )

Having db experience increases the salary in mean and median.

(so_df
 .pipe(na_performance, 'DatabaseWorkedWith')
)
ConvertedSalary
min p10 median p90 mean max len
some_experience
False 0.0 5640.0 53851.0 146198.8 93548.955018 2000000.0 7114.0
True 0.0 7884.0 55447.5 140764.0 96172.055854 2000000.0 40588.0

Having framework experience increases the salary in all statistics.

(so_df
 .pipe(na_performance, 'FrameworkWorkedWith')
)
ConvertedSalary
min p10 median p90 mean max len
some_experience
False 0.0 5640.0 50000.0 138660.0 88335.792343 2000000.0 15097.0
True 0.0 9012.0 57912.0 145000.0 99228.131023 2000000.0 32605.0

Having language exeprience increases the salary in all statistics (there are few surveys without language experience).

(so_df
 .pipe(na_performance, 'LanguageWorkedWith')
)
ConvertedSalary
min p10 median p90 mean max len
some_experience
False 0.0 3534.0 35244.0 125001.4 81521.755738 2000000.0 1220.0
True 0.0 7827.0 55562.0 142360.8 96155.116540 2000000.0 46482.0

Having gone to hackathons also helps.

(so_df
 .pipe(na_performance, 'HackathonReasons')
)
ConvertedSalary
min p10 median p90 mean max len
some_experience
False 0.0 7085.6 52872.0 140000.0 94593.563503 2000000.0 29597.0
True 0.0 8768.4 59172.0 146868.0 97721.787904 2000000.0 18105.0

Bootcamps seem to help too, in all statistics except p10.

(so_df
 .pipe(na_performance, 'TimeAfterBootcamp')
)
ConvertedSalary
min p10 median p90 mean max len
some_experience
False 0.0 7615.2 55075.0 141000.0 95649.656133 2000000.0 43194.0
True 0.0 7015.2 60000.0 145000.0 97038.026176 2000000.0 4508.0

I've also heard that Cobol developers make a lot of money. This supports it, although we have low statistical mass for Cobol devs:

so_df['cobol'] = so_df.LanguageWorkedWith.str.contains('Cobol')
salary_summary(so_df, 'cobol')
ConvertedSalary
min p10 median p90 mean max len
cobol
False 0.0 7827.0 55562.0 142000.0 95957.630165 2000000.0 46180.0
True 0.0 8078.1 58749.0 178830.0 126353.529801 2000000.0 302.0

Modelling

Let's get started with the modelling. First of all, we'll split our data in train and test sets using a simple random split.

x_df = so_df.drop(columns=['ConvertedSalary', 'Salary'])
y_df = so_df['ConvertedSalary']

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, random_state=42)

Baselines

We're going to try 4 baselines and compare our models to these baselines:

  • Use the global mean to approximate the salary (no model).
  • Use the global median to approximate the salary (no model).
  • Use the country median to approximate the salary (almost no model).
  • Use a simple linear regression with the numeric features that we currently have.

We are going to measure the performance of the baselines based on the test data using the mean absolute error as a metric. I like to use the mean absolute error as it is easier to interpret than the RMSE.

global_median = y_train.median()
global_median
55075.0
global_mean = y_train.mean()
global_mean
95646.96673747763

Our two dummiest baselines have a MAE of 70k anual dollar salary and 83k. This is pretty high, and we notice that the median is a way better estimator than the mean w.r.t the MAE. This makes sense as the median minimizes the mae.

constant_prediction = [global_median for i in y_test]
mean_absolute_error(constant_prediction, y_test)
70951.85552574208
constant_prediction_mean = [global_mean for i in y_test]
mean_absolute_error(constant_prediction_mean, y_test)
82711.68644191355

Let's build a simple model that estimates the salary of a survey using the median of its country.

simple_model = (x_train
 .assign(y=y_train)
 .groupby('Country')
 .agg({'y': 'median'})
 .reset_index()
)
simple_predictions = (x_test
 .merge(simple_model, how='left')
 .loc[:, 'y']
)
# If no training data of that country, just use the global median
simple_predictions[simple_predictions.isnull()] = global_median

The MAE of this model is significantly lower than the other baselines (around 58k).

mean_absolute_error(simple_predictions, y_test)
58623.49387892001

The last baseline that we are going to use is a simple linear regression with the numeric variables that we have. The linear regression model performs much worse than the dummy country model. I think this can be explained by the fact that the linear model optimizes the sum of squared errors, and this doesn't optimize the MAE in any way.

pipe = Pipeline([
    ('selector', PandasTypeSelector(include='float64')),
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('lr', LinearRegression())
])

pipe.fit(x_train, y_train)

mean_absolute_error(pipe.predict(x_test), y_test)
76507.00916447907

Improve the baseline

We'll compute some features regarding the missingness of experience:

so_df['cobol_int'] = so_df['cobol'].astype(float)
so_df['null_db'] = so_df['DatabaseWorkedWith'].isnull().astype(float)
so_df['null_fwork'] = so_df['FrameworkWorkedWith'].isnull().astype(float)
so_df['null_lang'] = so_df['LanguageWorkedWith'].isnull().astype(float)
so_df['null_hck'] = so_df['HackathonReasons'].isnull().astype(float)
so_df['null_boot'] = so_df['TimeAfterBootcamp'].isnull().astype(float)

And split the data again using the same split (random_state = 42).

x_df = so_df.drop(columns=['ConvertedSalary', 'Salary'])
y_df = so_df['ConvertedSalary']
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, random_state=42)

Now we use much better modelling techniques:

  • New features as described above.
  • Target encoding of categorical features.
  • Lightgbm instead of linear model.

And measure the cross-validation MAE before keeping a model.

te = TargetEncoder(cols=['Country', 
                         'JobSatisfaction', 
                         'CareerSatisfaction', 
                         'RaceEthnicity', 
                         'CompanySize', 
                         'OperatingSystem'
                        ])
pipe_lgb = Pipeline([
    ('endoer', te),
    ('selector', PandasTypeSelector(include='float64')),
    ('imputer', SimpleImputer(strategy='mean')),
    ('lr', LGBMRegressor())
])
cv_lgb = cross_val_score(
    pipe_lgb, 
    x_train, 
    y_train, 
    scoring='neg_mean_absolute_error', 
    cv=3
)

- cv_lgb.mean()
74719.25784492154

Disappointingly, our model is better than the linear regression, but way worse than the country-level model. It looks like all the models that use the median are way better than the models that optimize sums of squares.

Luckily, we can change the lightgbm objective from L2 to L1, and the lightgbm will optimize the mae.

pipe_lgb = Pipeline([
    ('endoer', te),
    ('selector', PandasTypeSelector(include='float64')),
    ('imputer', SimpleImputer(strategy='mean')),
    ('lr', LGBMRegressor(objective='regression_l1'))
])
cv_lgb = cross_val_score(
    pipe_lgb, 
    x_train, 
    y_train, 
    scoring='neg_mean_absolute_error', 
    cv=3
)

- cv_lgb.mean()
53086.352945243685

Now we're talking, the cross-validation MAE is lower than in all baselines. Let's see if this is still true in the test data.

pipe_lgb.fit(x_train, y_train)
Pipeline(steps=[('endoer',
                 TargetEncoder(cols=['Country', 'JobSatisfaction',
                                     'CareerSatisfaction', 'RaceEthnicity',
                                     'CompanySize', 'OperatingSystem'])),
                ('selector', PandasTypeSelector(include='float64')),
                ('imputer', SimpleImputer()),
                ('lr', LGBMRegressor(objective='regression_l1'))])
mean_absolute_error(pipe_lgb.predict(x_test), y_test)
53234.00734285567

Indeed, the model has improved on test data. MAE has gone from 58k (our best baseline) to 53k. The improvement is not huge, and in a production environment we'd ask if this is necessary, but we now have a slightly better model and we can iterate with more features to improve it even more.

For instance, we haven't used programming language propperly as a feature and this can help the model significantly.

Summary

  • Pandas pipes allow to have way cleaner code.
  • When optimizing MAE, L1-like methods (least deviation regression, L1 optimization in lightgbm, median summaries) might make more sense than standard methods.