Understanding Survival Rates

An important caveat about understanding cancer is that treatment is not necessarily about “curing” the disease; to summarise Hank Green, a YouTuber who documented his diagnosis and treatment, of the 9000 Americans who develop Hodgkin Lymphoma, 900 die. “Boom, that’s your statistic right there, 10%, right? Nope, not really!”

Survival rates are not designed to be interpreted as simply as that; according to the CDC, such interpretations don’t take into account factors like the age of the individual at time of diagnosis, the duration of the disease, the stage of the cancer, and sophistication of available treatment and technology, and many other factors. Let’s take a look at some data to understand this better.

Patient Journeys

I found a small data set from Our World In Data that we can use to visualize how patients’ journeys are not always the same.

Fetching the Data

We’ll use the data.world package to pull directly from the site and use a short SQL query to check the tables and pull the data into a dataframe:

suppressPackageStartupMessages({
    library(tidyverse)
    library(ggrain)
    library(plotly)
    library(data.world)
})
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'ggrain' was built under R version 4.2.3
Warning: package 'plotly' was built under R version 4.2.3
ds <- "https://data.world/makeovermonday/2018w40-five-year-cancer-survival-rates-in-america"

tables <- data.world::query(
    data.world::qry_sql("SELECT * FROM Tables"),
    dataset = ds
)
Rows: 1 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): tableId, tableName, tableTitle, owner, dataset
lgl (1): tableDescription

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(tables)
# A tibble: 1 × 6
  tableId                    tableName tableTitle tableDescription owner dataset
  <chr>                      <chr>     <chr>      <lgl>            <chr> <chr>  
1 five_year_cancer_survival… five_yea… five_year… NA               make… 2018w4…
query <- "SELECT * from five_year_cancer_survival_rates_in_usa"

five_yr <- data.world::query(
    data.world::qry_sql(query),
    dataset = ds
)
Rows: 1887 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): race, gender, cancer_type
dbl (2): survival_rate, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
five_yr
# A tibble: 1,887 × 5
   survival_rate  year race      gender  cancer_type
           <dbl> <dbl> <chr>     <chr>   <chr>      
 1         0.559  1977 All races females All cancers
 2         0.55   1980 All races females All cancers
 3         0.551  1983 All races females All cancers
 4         0.576  1986 All races females All cancers
 5         0.596  1989 All races females All cancers
 6         0.609  1992 All races females All cancers
 7         0.618  1995 All races females All cancers
 8         0.511  1989 All races males   All cancers
 9         0.591  1992 All races males   All cancers
10         0.608  1995 All races males   All cancers
# ℹ 1,877 more rows

From glimpsing this data, we can see that the term “survival rate” can be broken down by race, gender, and cancer type. Essentially, each row tells us what the 5-year survival rate would be that year for a person who contracts that cancer.

Let’s visualize:

five_yr %>% 
    ggplot(aes(x = 1, y = survival_rate, fill = race, colour = race)) +
    theme_minimal() +
    geom_rain(alpha = .5) +
    coord_flip() +
    theme(axis.text.y = element_blank(), axis.title.y = element_blank())+ 
    labs(title = "Survival rates of aspects of cancer over time", subtitle = "Race")
Warning: Removed 221 rows containing non-finite values
(`stat_half_ydensity()`).
Warning: Removed 221 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 221 rows containing missing values (`geom_point_sorted()`).

five_yr %>%
     ggplot(aes(x = 1, y = survival_rate, fill = gender, colour = gender)) +
    theme_minimal() +
    geom_rain(alpha = .5) +
    coord_flip() +
    theme(axis.text.y = element_blank(), axis.title.y = element_blank())+ 
    labs(title = "Survival rates of aspects of cancer over time", subtitle = "Gender")
Warning: Removed 221 rows containing non-finite values
(`stat_half_ydensity()`).
Warning: Removed 221 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 221 rows containing missing values (`geom_point_sorted()`).

five_yr %>%
    ggplot(aes(x = cancer_type, y = survival_rate)) +
    theme_minimal() +
    geom_violin() +
    theme(axis.text.x = element_text(angle = 45))+ 
    labs(title = "Survival rates of aspects of cancer over time", subtitle = "Cancer type")
Warning: Removed 221 rows containing non-finite values (`stat_ydensity()`).

These data appear to be a bit of a mish mosh, but remember that this is over time. Let’s try and put that in perspective by making something similar to the original plot.

Let’s take a look at leukemia for example:

five_yr %>%
    filter(cancer_type == "Leukemia") %>%
    ggplot(aes(x = year, y = survival_rate, colour = race)) +
    geom_line() +
    facet_grid(gender~cancer_type)
Warning: Removed 4 rows containing missing values (`geom_line()`).

This shows an upward trend overall, but there wouldn’t be enough reasonable room to plot each and every combination of variables. We should reasonably compress some of them. I’m going to assume that we wouldn’t be missing out on too much if we ignored intermittent downward trends, so maybe it would be interesting to see the change in survival rate over time?

Change in Survival Rate Over Time

five_yr %>%
    filter(cancer_type == "Leukemia") %>%
    filter(!is.na(survival_rate)) %>%
    group_by(race, gender) %>%
    summarise(survival_rate_change = last(survival_rate) - first(survival_rate))
`summarise()` has grouped output by 'race'. You can override using the
`.groups` argument.
# A tibble: 9 × 3
# Groups:   race [3]
  race      gender  survival_rate_change
  <chr>     <chr>                  <dbl>
1 All races females                0.261
2 All races males                  0.322
3 All races total                  0.296
4 Black     females                0.194
5 Black     males                  0.286
6 Black     total                  0.244
7 White     females                0.272
8 White     males                  0.324
9 White     total                  0.508

In this table, the percent of people who survive leukemia has increased by the value in the rightmost column. We can think of this as an “improvement in outcomes” measure over time. Let’s do this for all cancers:

five_yr %>%
    filter(!is.na(survival_rate)) %>%
    group_by(race, gender, cancer_type) %>%
    summarise(survival_rate_change = last(survival_rate) - first(survival_rate)) %>%
    arrange(survival_rate_change)
`summarise()` has grouped output by 'race', 'gender'. You can override using
the `.groups` argument.
# A tibble: 141 × 4
# Groups:   race, gender [9]
   race      gender  cancer_type       survival_rate_change
   <chr>     <chr>   <chr>                            <dbl>
 1 Black     females Skin                          -0.146  
 2 Black     males   Skin                          -0.0710 
 3 All races males   All cancers                   -0.0390 
 4 All races females Cervix uteri                  -0.00300
 5 All races total   Cervix uteri                  -0.00300
 6 Black     total   All cancers                    0.00200
 7 Black     females Pancreas                       0.047  
 8 All races males   Thyroid                        0.0510 
 9 All races males   Lung and bronchus              0.055  
10 All races males   Bladder                        0.0550 
# ℹ 131 more rows

This is interesting. We can now see that improvement varies by race, gender, and cancer type, with some survival rates actually decreasing over time. That’s worrying. Let’s plot this to see the full gambit:

five_yr %>%
    filter(!is.na(survival_rate)) %>%
    group_by(race, gender, cancer_type) %>%
    summarise(survival_rate_change = last(survival_rate) - first(survival_rate)) %>%
    arrange(survival_rate_change) -> five_yr_change
`summarise()` has grouped output by 'race', 'gender'. You can override using
the `.groups` argument.
pp <- five_yr_change %>%
    ggplot(aes(y = survival_rate_change, x=cancer_type, fill=gender)) +
    geom_col(position = "dodge") +
    coord_flip() +
    theme_minimal() +
    facet_grid(~race) + 
    labs(title = "Change in 5-year Survival Rate % of Different Cancers", y = "Survival Rate Change from 1977 to 2013", 
    caption = "The plot shows how the percentage of people who survive 5 years after diagnosis has changed over the last 40 years.")

ggplotly(pp)

From this chart, we can conclude that there have been improvements in almost all cancer treatments with a few exceptions — for example, skin cancer amongst black females has seen a noticeable drop in 5-year survival rates. Why would this be?

Conlcusion

This is a quick and dirty look at the data to get a sense of where we are, but overall still speaks to the original comment — understanding survival rates is complicated and all interpretations have to be looked at within their respective contexts.

Pkgs

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] data.world_1.3.1 dwapi_0.3.2      plotly_4.10.4    ggrain_0.0.4    
 [5] lubridate_1.9.2  forcats_1.0.0    stringr_1.5.0    dplyr_1.1.4     
 [9] purrr_1.0.2      readr_2.1.5      tidyr_1.3.0      tibble_3.2.1    
[13] ggplot2_3.4.3    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1  xfun_0.37         colorspace_2.1-0  vctrs_0.6.5      
 [5] generics_0.1.3    htmltools_0.5.8.1 viridisLite_0.4.2 yaml_2.3.6       
 [9] utf8_1.2.4        rlang_1.1.3       pillar_1.9.0      glue_1.7.0       
[13] withr_3.0.0       bit64_4.0.5       ini_0.3.1         lifecycle_1.0.4  
[17] munsell_0.5.0     gtable_0.3.3      htmlwidgets_1.6.2 evaluate_0.22    
[21] labeling_0.4.2    knitr_1.42        tzdb_0.4.0        gghalves_0.1.4   
[25] fastmap_1.1.1     crosstalk_1.2.1   parallel_4.2.2    curl_5.2.1       
[29] fansi_1.0.6       ggpp_0.5.6        polynom_1.4-1     renv_1.0.5       
[33] scales_1.2.1      vroom_1.6.5       jsonlite_1.8.8    farver_2.1.1     
[37] bit_4.0.5         hms_1.1.3         digest_0.6.35     stringi_1.8.3    
[41] grid_4.2.2        cli_3.6.2         tools_4.2.2       magrittr_2.0.3   
[45] lazyeval_0.2.2    crayon_1.5.2      pkgconfig_2.0.3   MASS_7.3-58.1    
[49] data.table_1.14.8 timechange_0.2.0  rmarkdown_2.21    httr_1.4.7       
[53] R6_2.5.1          compiler_4.2.2