Survival analysis

One year ago I joined a hackathon with some friends. The topic of the hackathon was survival analysis. In particular, we worked to predict the survival of melanoma patients. Survival analysis is different from other supervised learning frameworks for the two following reasons:

The output of the algorithm is a curve of probabilities that depend on time. This makes it different from the classification setting where the output is a probability or some probabilities that sum up to one. The functional object to be described is substantially different. This curve is known as survival curve and will be denoted as $S(t)$.
Censoring: for many observations, we don’t have all the information (they are censored). In the medical setting, these might be patients that stopped going to the doctor and we are not really sure about their health status.

A business application of survival analysis is predicting survival curves for a customer using a product. This, is we want to estimate the probability of the customer to still use the product at a given time $t$ in the future. This might allow to compute lifetime values for customers. Wether there is censoring or not might depend on the business setting.

Brier score

The metric we were asked to optimize was the Brier score.

Let’s say we are asked to predict in times from 0 to $T$. If a patient dies at time $t’$ and the output of our model is a function $S(t)$, the Brier score for that patient is:

\sum_{t=0}^{t'} (1 - S(t))^2 + \sum_{t=t'}^{T} S(t)^2

Which is roughly the L2 distance between the survival curve S(t) and the target curve.

For patients that survive until time $T$, the L2 distance is also used (same formula as above, but changing $t’$ by $T$).

For patients that are censored, the metric is similar but some weights multiply it according the the conditional survival function.

If we want to compute the Brier score of the whole population, we just aggregate using the mean of every individual Brier score.

Classic methods

Kaplan-Meier

The Kaplan-Meier estimator should be the baseline for any survival analysis model development. The Kaplan-Meier estimator is the equivalent of a constant model that just predicts the mean of the training observations in the regression setting.

If at time $t_i$ there are $n_i$ patients alive (and not censored) and $d_i$ dead patients, the Kaplan-Meier curve is defined via the following:

S(t) = \prod_{t_i \leq t}\left(1 - \frac{d_i}{n_i}\right)

If there isn’t any death or censoring between $t_1$ and $t_2$ in the training set, then $S(t_1) = S(t_2)$.

Cox model

The cox model estimates the hazard function instead of the surival function. The hazard function is defined as the death rate at time t conditional on survival until time t or later. It is related to the survival function by the following equation:

h(t) = -\frac{S'(t)}{S(t)}

And conversely:

S(t) = exp\left(-\int_0^t h(u)du\right)

Given some features $(x_1,\dots,x_n)$ and some coefficients $(b_1,\dots,b_n)$, the Cox model obtains the hazard function by

h(t) = h_0(t)exp(b_1x_1 + \dots + b_nx_n)

The main assumption in the cox model are the so called proportional hazards. Proportional hazards intuitively mean that if individual A has higher survival probability at a time $t_1$ than individual B, then it has a higher survival probability at all times.

Both linear dependence from the coefficients and proportional hazards are assumptions that make the Cox model, in general, a high bias model.

This is the model our team used to calculate survival curves. Most of the work we did was based on feature engineering.

Metric-optimizing approach

Unfortunately our model ended up in second position, far from the first place in the leaderboard.

This section explains the first team solution.

Let us recall what the Brier score for a set of individuals is defined as

BS = \frac{1}{NT} \sum_{i=1}^N \sum_{t=0}^{T} w_i(t) (y_i(t) - S_i(t))^2

where $y_i(t)$ is the survival target for that patient, being 1 before death time and 0 after death time, and $w_i(t)$ the censoring weights for that patient. A nice trick is to reformulate the sum as

BS = \frac{1}{NT} \sum_{t=0}^T \sum_{i=1}^{N} w_i(t) (y_i(t) - S_i(t))^2

Every of the inner summands can be thought as a regression problem. So, $T$ independent regression models can be trained, with any algorithm or method you want, and this directly optimizes the Brier score. On the other hand, the Cox model doesn’t optimize the Brier score directly.

This trick allowed the first team to optimize the metric and have better predictive curves. These curves had a property that would have made any statistician cry, as they survival function was non-monotonous. However, their error metric was way lower.