Simple Lung Cancer Classifier

Introduction

Machine learning is an effective tool to use in oncology. By training models on simple phenotypic characteristics, it’s possible to train powerful classifiers for predicting whether or not an individual may have cancer. These kinds of simple classifiers can be used as early detection mechanisms, such that even from a small amount of phenotypic descriptions, doctors can save time by quickly sifting through the probability of a cancer being the root cause of someone’s illness.

Let’s build a quick classifier for lung cancer based on an example Kaggle dataset.

suppressPackageStartupMessages(
    {
    library(tidyverse)
    library(tidymodels)
    library(skimr)
    library(themis)
    }
)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'tidymodels' was built under R version 4.2.3
Warning: package 'dials' was built under R version 4.2.3
Warning: package 'scales' was built under R version 4.2.3
Warning: package 'infer' was built under R version 4.2.3
Warning: package 'modeldata' was built under R version 4.2.3
Warning: package 'parsnip' was built under R version 4.2.3
Warning: package 'recipes' was built under R version 4.2.3
Warning: package 'rsample' was built under R version 4.2.3
Warning: package 'tune' was built under R version 4.2.3
Warning: package 'workflows' was built under R version 4.2.3
Warning: package 'workflowsets' was built under R version 4.2.3
Warning: package 'yardstick' was built under R version 4.2.3

Let’s look at the data:

url <- "https://storage.googleapis.com/kagglesdsdata/datasets/1623385/2668247/survey%20lung%20cancer.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240506%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240506T191835Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2d244357131a33758f10bf7f3a83f5584f02ba69ab6ef5e31cb5f7c9cc3ea73922db7cb03bb47b5345455b27f4c25f4434be415327d7817b954933fe86aefc50b07fb4e6a0736238366f4a922a91615222be63de02ab687a9759eace7c59e62468deb656a3abafea85a93f0fd77fac708c2d37bece32da9669b5add662affdcfa40c73b9a8413d9039f207492544aba4ddd2f13baaa10a4b92327ea52db9322797db7113eeb00deb5ed073ddde2c58915c7b94710a2bec1dc03e4950fb20234837f03b0d4a7d5bbbbcde8c98eb2c1e7355acd3c179adf9f66d626c27ee70b46828f7d1e0b8222e446296fc3d813b8dfd42219bc6887266bf6bcc9c2ae2750aaa"
df <- read_csv(url)
Rows: 309 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): GENDER, LUNG_CANCER
dbl (14): AGE, SMOKING, YELLOW_FINGERS, ANXIETY, PEER_PRESSURE, CHRONIC DISE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(df)
Data summary
Name df
Number of rows 309
Number of columns 16
_______________________
Column type frequency:
character 2
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GENDER 0 1 1 1 0 2 0
LUNG_CANCER 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AGE 0 1 62.67 8.21 21 57 62 69 87 ▁▁▆▇▂
SMOKING 0 1 1.56 0.50 1 1 2 2 2 ▆▁▁▁▇
YELLOW_FINGERS 0 1 1.57 0.50 1 1 2 2 2 ▆▁▁▁▇
ANXIETY 0 1 1.50 0.50 1 1 1 2 2 ▇▁▁▁▇
PEER_PRESSURE 0 1 1.50 0.50 1 1 2 2 2 ▇▁▁▁▇
CHRONIC DISEASE 0 1 1.50 0.50 1 1 2 2 2 ▇▁▁▁▇
FATIGUE 0 1 1.67 0.47 1 1 2 2 2 ▃▁▁▁▇
ALLERGY 0 1 1.56 0.50 1 1 2 2 2 ▆▁▁▁▇
WHEEZING 0 1 1.56 0.50 1 1 2 2 2 ▆▁▁▁▇
ALCOHOL CONSUMING 0 1 1.56 0.50 1 1 2 2 2 ▆▁▁▁▇
COUGHING 0 1 1.58 0.49 1 1 2 2 2 ▆▁▁▁▇
SHORTNESS OF BREATH 0 1 1.64 0.48 1 1 2 2 2 ▅▁▁▁▇
SWALLOWING DIFFICULTY 0 1 1.47 0.50 1 1 1 2 2 ▇▁▁▁▇
CHEST PAIN 0 1 1.56 0.50 1 1 2 2 2 ▆▁▁▁▇

From this, we understand that almost all of the variables are binary “Y|N” questions, with the exception of age. Let’s see the distribution of data:

df %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar()

This is imbalanced data for sure, as we can see there are multiple positive cases of lung cancer than negative. We’ll keep this in mind for later. For now, let’s see if there’s any noticeable distribution differences in the two classes:

df %>%
    group_by(LUNG_CANCER) %>%
    skim()
Data summary
Name Piped data
Number of rows 309
Number of columns 16
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables LUNG_CANCER

Variable type: character

skim_variable LUNG_CANCER n_missing complete_rate min max empty n_unique whitespace
GENDER NO 0 1 1 1 0 2 0
GENDER YES 0 1 1 1 0 2 0

Variable type: numeric

skim_variable LUNG_CANCER n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AGE NO 0 1 60.74 9.63 21 57 61.0 65.5 87 ▁▁▇▇▁
AGE YES 0 1 62.95 7.97 38 58 62.5 69.0 81 ▁▃▇▇▂
SMOKING NO 0 1 1.49 0.51 1 1 1.0 2.0 2 ▇▁▁▁▇
SMOKING YES 0 1 1.57 0.50 1 1 2.0 2.0 2 ▆▁▁▁▇
YELLOW_FINGERS NO 0 1 1.33 0.48 1 1 1.0 2.0 2 ▇▁▁▁▃
YELLOW_FINGERS YES 0 1 1.60 0.49 1 1 2.0 2.0 2 ▅▁▁▁▇
ANXIETY NO 0 1 1.31 0.47 1 1 1.0 2.0 2 ▇▁▁▁▃
ANXIETY YES 0 1 1.53 0.50 1 1 2.0 2.0 2 ▇▁▁▁▇
PEER_PRESSURE NO 0 1 1.26 0.44 1 1 1.0 1.5 2 ▇▁▁▁▃
PEER_PRESSURE YES 0 1 1.54 0.50 1 1 2.0 2.0 2 ▇▁▁▁▇
CHRONIC DISEASE NO 0 1 1.36 0.49 1 1 1.0 2.0 2 ▇▁▁▁▅
CHRONIC DISEASE YES 0 1 1.53 0.50 1 1 2.0 2.0 2 ▇▁▁▁▇
FATIGUE NO 0 1 1.49 0.51 1 1 1.0 2.0 2 ▇▁▁▁▇
FATIGUE YES 0 1 1.70 0.46 1 1 2.0 2.0 2 ▃▁▁▁▇
ALLERGY NO 0 1 1.13 0.34 1 1 1.0 1.0 2 ▇▁▁▁▁
ALLERGY YES 0 1 1.62 0.49 1 1 2.0 2.0 2 ▅▁▁▁▇
WHEEZING NO 0 1 1.23 0.43 1 1 1.0 1.0 2 ▇▁▁▁▂
WHEEZING YES 0 1 1.60 0.49 1 1 2.0 2.0 2 ▅▁▁▁▇
ALCOHOL CONSUMING NO 0 1 1.18 0.39 1 1 1.0 1.0 2 ▇▁▁▁▂
ALCOHOL CONSUMING YES 0 1 1.61 0.49 1 1 2.0 2.0 2 ▅▁▁▁▇
COUGHING NO 0 1 1.26 0.44 1 1 1.0 1.5 2 ▇▁▁▁▃
COUGHING YES 0 1 1.63 0.48 1 1 2.0 2.0 2 ▅▁▁▁▇
SHORTNESS OF BREATH NO 0 1 1.56 0.50 1 1 2.0 2.0 2 ▆▁▁▁▇
SHORTNESS OF BREATH YES 0 1 1.65 0.48 1 1 2.0 2.0 2 ▅▁▁▁▇
SWALLOWING DIFFICULTY NO 0 1 1.13 0.34 1 1 1.0 1.0 2 ▇▁▁▁▁
SWALLOWING DIFFICULTY YES 0 1 1.52 0.50 1 1 2.0 2.0 2 ▇▁▁▁▇
CHEST PAIN NO 0 1 1.31 0.47 1 1 1.0 2.0 2 ▇▁▁▁▃
CHEST PAIN YES 0 1 1.59 0.49 1 1 2.0 2.0 2 ▆▁▁▁▇

It looks like fatigue, yellow fingers, allergies, alcohol consumption, and swallowing difficulty are are likely going to be particularly discriminant in differentiating the diagnoses, because they have noticeable imbalance for each case in the histograms.

First Model

Because this is a dataset with a lot of binary decisions, it makes the most sense to use a decision-tree-based model for this data. Additionally, we’re going to use a validation set to test our model so that we don’t commit data leakage.

train_test_split <- initial_validation_split(df, prop = c(0.60, 0.2), strata = LUNG_CANCER)

train <- training(train_test_split)
test <- testing(train_test_split)
val <- validation(train_test_split)

Now using tidymodels, build a workflow:

rec <- recipe(LUNG_CANCER ~ ., data = train)

rf_mod <- rand_forest(mode = "classification", trees = 2000)

rf_flow <- workflow() %>%
    add_recipe(rec) %>%
    add_model(rf_mod)

rf_fit <- fit(rf_flow, train)

Our naive fit:

classification_metrics <- metric_set(accuracy, f_meas)

rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.935
2 f_meas   binary         0.6  

So on the validation set, we had high accuracy but a low F-score. Let’s see why that is:

rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)
          Truth
Prediction NO YES
       NO   3   1
       YES  3  55

So even in this small case, we were leaning towards predicting someone without cancer as them having cancer, which is a false positive. Let’s see if we can remedy this at all by tuning some of the random forest parameters:

Hyperparameter Tuning

tune_spec <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("ranger")
folds <- vfold_cv(train)
tune_rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(tune_spec)

set.seed(12345)
tune_res <- tune_grid(
  tune_rf_wf,
  resamples = folds,
  grid = 20,
  metrics = classification_metrics
)
i Creating pre-processing data to finalize unknown parameter: mtry
Warning: package 'ranger' was built under R version 4.2.3
→ A | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 4 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
→ B | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 1 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
→ C | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 2 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
There were issues with some computations   A: x1   B: x2   C: x1
→ D | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 3 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2   C: x1
There were issues with some computations   A: x1   B: x3   C: x1   D: x1
There were issues with some computations   A: x1   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3

Now let’s look at how the classification metrics over these tuned parameters:

tune_res %>%
    collect_metrics() %>%
    filter(.metric == "f_meas") %>%
    arrange(-mean)
# A tibble: 20 × 8
    mtry min_n .metric .estimator    mean     n std_err .config              
   <int> <int> <chr>   <chr>        <dbl> <int>   <dbl> <chr>                
 1     5    16 f_meas  binary       0.410     7   0.117 Preprocessor1_Model08
 2     4     4 f_meas  binary       0.410     7   0.117 Preprocessor1_Model10
 3    10    13 f_meas  binary       0.396     8   0.125 Preprocessor1_Model05
 4     7     9 f_meas  binary       0.396     8   0.125 Preprocessor1_Model06
 5    14    11 f_meas  binary       0.396     8   0.125 Preprocessor1_Model11
 6    12     6 f_meas  binary       0.396     8   0.125 Preprocessor1_Model18
 7     6    21 f_meas  binary       0.371     7   0.108 Preprocessor1_Model01
 8     7    21 f_meas  binary       0.371     7   0.108 Preprocessor1_Model07
 9     8    27 f_meas  binary       0.367     6   0.128 Preprocessor1_Model19
10     8    14 f_meas  binary       0.358     8   0.113 Preprocessor1_Model14
11     9    18 f_meas  binary       0.358     8   0.113 Preprocessor1_Model15
12    14    32 f_meas  binary       0.327     5   0.153 Preprocessor1_Model02
13    10    26 f_meas  binary       0.314     7   0.120 Preprocessor1_Model03
14    13    36 f_meas  binary       0.167     3   0.167 Preprocessor1_Model13
15    13    35 f_meas  binary       0.167     3   0.167 Preprocessor1_Model17
16     3    29 f_meas  binary       0         1  NA     Preprocessor1_Model04
17    11    38 f_meas  binary       0         2   0     Preprocessor1_Model12
18     1     7 f_meas  binary       0         1  NA     Preprocessor1_Model16
19     2    24 f_meas  binary       0         1  NA     Preprocessor1_Model20
20     3    34 f_meas  binary     NaN         0  NA     Preprocessor1_Model09

The F score hasn’t improved by resampling, so instead, let’s over sample the minority case in our recipe:

SMOTE for Imbalance

rec %>%
    step_dummy(GENDER) %>%
    step_smote(LUNG_CANCER) -> rec_ov_sampled

Let’s see how this affects the data:

rec %>%
    prep() %>%
    juice() %>%
    group_by(LUNG_CANCER) %>%
    skim()
Data summary
Name Piped data
Number of rows 185
Number of columns 16
_______________________
Column type frequency:
factor 1
numeric 14
________________________
Group variables LUNG_CANCER

Variable type: factor

skim_variable LUNG_CANCER n_missing complete_rate ordered n_unique top_counts
GENDER NO 0 1 FALSE 2 F: 13, M: 10
GENDER YES 0 1 FALSE 2 M: 87, F: 75

Variable type: numeric

skim_variable LUNG_CANCER n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AGE NO 0 1 61.43 11.82 21 57 63 68.0 87 ▁▁▅▇▁
AGE YES 0 1 62.80 8.61 38 58 62 70.0 81 ▁▃▇▆▃
SMOKING NO 0 1 1.48 0.51 1 1 1 2.0 2 ▇▁▁▁▇
SMOKING YES 0 1 1.61 0.49 1 1 2 2.0 2 ▅▁▁▁▇
YELLOW_FINGERS NO 0 1 1.30 0.47 1 1 1 2.0 2 ▇▁▁▁▃
YELLOW_FINGERS YES 0 1 1.60 0.49 1 1 2 2.0 2 ▅▁▁▁▇
ANXIETY NO 0 1 1.22 0.42 1 1 1 1.0 2 ▇▁▁▁▂
ANXIETY YES 0 1 1.51 0.50 1 1 2 2.0 2 ▇▁▁▁▇
PEER_PRESSURE NO 0 1 1.17 0.39 1 1 1 1.0 2 ▇▁▁▁▂
PEER_PRESSURE YES 0 1 1.54 0.50 1 1 2 2.0 2 ▇▁▁▁▇
CHRONIC DISEASE NO 0 1 1.39 0.50 1 1 1 2.0 2 ▇▁▁▁▅
CHRONIC DISEASE YES 0 1 1.55 0.50 1 1 2 2.0 2 ▆▁▁▁▇
FATIGUE NO 0 1 1.52 0.51 1 1 2 2.0 2 ▇▁▁▁▇
FATIGUE YES 0 1 1.70 0.46 1 1 2 2.0 2 ▃▁▁▁▇
ALLERGY NO 0 1 1.17 0.39 1 1 1 1.0 2 ▇▁▁▁▂
ALLERGY YES 0 1 1.62 0.49 1 1 2 2.0 2 ▅▁▁▁▇
WHEEZING NO 0 1 1.26 0.45 1 1 1 1.5 2 ▇▁▁▁▃
WHEEZING YES 0 1 1.60 0.49 1 1 2 2.0 2 ▅▁▁▁▇
ALCOHOL CONSUMING NO 0 1 1.22 0.42 1 1 1 1.0 2 ▇▁▁▁▂
ALCOHOL CONSUMING YES 0 1 1.60 0.49 1 1 2 2.0 2 ▅▁▁▁▇
COUGHING NO 0 1 1.35 0.49 1 1 1 2.0 2 ▇▁▁▁▅
COUGHING YES 0 1 1.62 0.49 1 1 2 2.0 2 ▅▁▁▁▇
SHORTNESS OF BREATH NO 0 1 1.61 0.50 1 1 2 2.0 2 ▅▁▁▁▇
SHORTNESS OF BREATH YES 0 1 1.68 0.47 1 1 2 2.0 2 ▃▁▁▁▇
SWALLOWING DIFFICULTY NO 0 1 1.09 0.29 1 1 1 1.0 2 ▇▁▁▁▁
SWALLOWING DIFFICULTY YES 0 1 1.57 0.50 1 1 2 2.0 2 ▆▁▁▁▇
CHEST PAIN NO 0 1 1.30 0.47 1 1 1 2.0 2 ▇▁▁▁▃
CHEST PAIN YES 0 1 1.62 0.49 1 1 2 2.0 2 ▅▁▁▁▇
rec %>%
    prep() %>%
    juice() %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar() +
    labs(title = "Lung Cancer diagnoses before SMOTE")

rec_ov_sampled %>%
    prep() %>%
    juice() %>%
    group_by(LUNG_CANCER) %>%
    skim()
Data summary
Name Piped data
Number of rows 324
Number of columns 16
_______________________
Column type frequency:
numeric 15
________________________
Group variables LUNG_CANCER

Variable type: numeric

skim_variable LUNG_CANCER n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AGE NO 0 1 62.12 7.94 21 56.96 63.00 68.25 87 ▁▁▅▇▁
AGE YES 0 1 62.80 8.61 38 58.00 62.00 70.00 81 ▁▃▇▆▃
SMOKING NO 0 1 1.51 0.41 1 1.00 1.55 1.97 2 ▇▂▂▂▇
SMOKING YES 0 1 1.61 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
YELLOW_FINGERS NO 0 1 1.32 0.41 1 1.00 1.00 1.78 2 ▇▁▁▁▃
YELLOW_FINGERS YES 0 1 1.60 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
ANXIETY NO 0 1 1.24 0.35 1 1.00 1.00 1.52 2 ▇▁▁▁▂
ANXIETY YES 0 1 1.51 0.50 1 1.00 2.00 2.00 2 ▇▁▁▁▇
PEER_PRESSURE NO 0 1 1.20 0.34 1 1.00 1.00 1.34 2 ▇▁▁▁▁
PEER_PRESSURE YES 0 1 1.54 0.50 1 1.00 2.00 2.00 2 ▇▁▁▁▇
CHRONIC DISEASE NO 0 1 1.32 0.39 1 1.00 1.05 1.67 2 ▇▂▁▁▃
CHRONIC DISEASE YES 0 1 1.55 0.50 1 1.00 2.00 2.00 2 ▆▁▁▁▇
FATIGUE NO 0 1 1.46 0.42 1 1.00 1.37 2.00 2 ▇▂▂▂▆
FATIGUE YES 0 1 1.70 0.46 1 1.00 2.00 2.00 2 ▃▁▁▁▇
ALLERGY NO 0 1 1.17 0.31 1 1.00 1.00 1.22 2 ▇▁▁▁▁
ALLERGY YES 0 1 1.62 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
WHEEZING NO 0 1 1.27 0.38 1 1.00 1.00 1.56 2 ▇▁▁▁▂
WHEEZING YES 0 1 1.60 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
ALCOHOL CONSUMING NO 0 1 1.22 0.34 1 1.00 1.00 1.42 2 ▇▁▁▁▁
ALCOHOL CONSUMING YES 0 1 1.60 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
COUGHING NO 0 1 1.37 0.40 1 1.00 1.18 1.77 2 ▇▂▁▂▃
COUGHING YES 0 1 1.62 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
SHORTNESS OF BREATH NO 0 1 1.56 0.41 1 1.14 1.58 2.00 2 ▅▂▂▁▇
SHORTNESS OF BREATH YES 0 1 1.68 0.47 1 1.00 2.00 2.00 2 ▃▁▁▁▇
SWALLOWING DIFFICULTY NO 0 1 1.10 0.26 1 1.00 1.00 1.00 2 ▇▁▁▁▁
SWALLOWING DIFFICULTY YES 0 1 1.57 0.50 1 1.00 2.00 2.00 2 ▆▁▁▁▇
CHEST PAIN NO 0 1 1.34 0.39 1 1.00 1.12 1.75 2 ▇▁▁▂▃
CHEST PAIN YES 0 1 1.62 0.49 1 1.00 2.00 2.00 2 ▅▁▁▁▇
GENDER_M NO 0 1 0.44 0.42 0 0.00 0.37 0.93 1 ▇▁▂▂▅
GENDER_M YES 0 1 0.54 0.50 0 0.00 1.00 1.00 1 ▇▁▁▁▇
rec_ov_sampled %>%
    prep() %>%
    juice() %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar() +
    labs(title = "Lung Cancer diagnoses after SMOTE")

Let’s try the model out with this recipe instead…

rf_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(rf_mod)

rf_fit <- fit(rf_smote_flow, train)
rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)
          Truth
Prediction NO YES
       NO   3   1
       YES  3  55

In this case, we’ve squeezed one better prediction out, and importantly it was indeed a false negative that we converted to a true positive.

I believe it’s probably time to try a different model — maybe RF isn’t best suited for this task.

SVM

Support vector machines are also good classifiers. Let’s try that out.

svm_mod <- svm_linear(
  cost = double(1),
  margin = double(1)
) %>%  
  set_mode("classification")
svm_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(svm_mod)

svm_fit <- fit(svm_smote_flow, train)
svm_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)
          Truth
Prediction NO YES
       NO   6  26
       YES  0  30

Interestingly, this model has more false negatives! Very bad for a field like cancer. Let’s try to tune it:

tune_spec <- svm_linear(
  cost = tune(),
  margin = tune()
) %>%  
  set_mode("classification")

tune_svm_wf <- workflow() %>%
  add_recipe(rec_ov_sampled) %>%
  add_model(tune_spec)

tune_res <- tune_grid(
  tune_svm_wf,
  resamples = folds,
  grid = 50,
  metrics = classification_metrics
)
Warning: package 'LiblineaR' was built under R version 4.2.3
→ A | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 4 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
→ B | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 1 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
There were issues with some computations   A: x1   B: x1
There were issues with some computations   A: x1   B: x2
→ C | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 2 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2
→ D | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 3 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2
There were issues with some computations   A: x1   B: x2   C: x1   D: x1
There were issues with some computations   A: x1   B: x3   C: x1   D: x2
There were issues with some computations   A: x1   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3
tune_res %>%
    collect_metrics() %>%
    filter(.metric == "f_meas") %>%
    arrange(-mean)
# A tibble: 50 × 8
       cost  margin .metric .estimator  mean     n std_err .config              
      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1  0.0595  0.0157  f_meas  binary     0.435     8  0.110  Preprocessor1_Model32
 2  0.932   0.112   f_meas  binary     0.408     7  0.102  Preprocessor1_Model19
 3 24.6     0.194   f_meas  binary     0.388    10  0.102  Preprocessor1_Model16
 4  0.641   0.0502  f_meas  binary     0.388     9  0.0989 Preprocessor1_Model14
 5  0.00602 0.145   f_meas  binary     0.374     9  0.0959 Preprocessor1_Model33
 6  0.104   0.152   f_meas  binary     0.362     7  0.0489 Preprocessor1_Model36
 7  6.29    0.0538  f_meas  binary     0.347    10  0.0891 Preprocessor1_Model03
 8  0.199   0.179   f_meas  binary     0.344     9  0.0969 Preprocessor1_Model18
 9  0.00434 0.00752 f_meas  binary     0.343     8  0.0981 Preprocessor1_Model17
10  2.40    0.00953 f_meas  binary     0.343     8  0.0977 Preprocessor1_Model37
# ℹ 40 more rows

It looks like there is an improvement in F score over the random forest, so I’ll stick with this model for now.

Let’s try these best parameters:

best <- tune_res %>% select_best(metric = "f_meas")
svm_best <- svm_linear(
  cost = best$cost,
  margin = best$margin
) %>%  
  set_mode("classification")

svm_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(svm_best)

svm_fit <- fit(svm_smote_flow, train)
svm_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.919
2 f_meas   binary         0.444

Not much improvement!

Another Model: Decision Tree rpart

I’ve had some success with this algorithm in the past, let’s see if it works here:

rpart_mod <- decision_tree(
    mode = "classification"
)
rpart_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(rpart_mod)

rpart_fit <- fit(svm_smote_flow, train)
rpart_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)
          Truth
Prediction NO YES
       NO   6  12
       YES  0  44

Not too much better, to be honest. For the sake of time, let’s settle on our best, the tuned SVM with a margin of 0.0157203 and cost of 0.0595231.

svm_fit %>%
    predict(test) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)
          Truth
Prediction NO YES
       NO   1   2
       YES  5  54
svm_fit %>%
    predict(test) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.887
2 f_meas   binary         0.222

Conclusion

One of the most important aspects of disease prediction, especially for a disease like cancer, is that your models should ultimately be attempting to reduce harmful errors. In this case, false negatives would be devastating as they can be costly. In this experiment, I found using the F-measure to be a good metric as it let’s us measure not just accuracy, but also the likelihood of making such errors (called Recall in this example).

The final model underperformed in this respect, but I’m sure given some more time, there is an ideal model and parameter set that would reduce false negatives.