Simple Lung Cancer Classifier

Introduction

Machine learning is an effective tool to use in oncology. By training models on simple phenotypic characteristics, it’s possible to train powerful classifiers for predicting whether or not an individual may have cancer. These kinds of simple classifiers can be used as early detection mechanisms, such that even from a small amount of phenotypic descriptions, doctors can save time by quickly sifting through the probability of a cancer being the root cause of someone’s illness.

Let’s build a quick classifier for lung cancer based on an example Kaggle dataset.

suppressPackageStartupMessages(
    {
    library(tidyverse)
    library(tidymodels)
    library(skimr)
    library(themis)
    }
)

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'tidymodels' was built under R version 4.2.3

Warning: package 'dials' was built under R version 4.2.3

Warning: package 'scales' was built under R version 4.2.3

Warning: package 'infer' was built under R version 4.2.3

Warning: package 'modeldata' was built under R version 4.2.3

Warning: package 'parsnip' was built under R version 4.2.3

Warning: package 'recipes' was built under R version 4.2.3

Warning: package 'rsample' was built under R version 4.2.3

Warning: package 'tune' was built under R version 4.2.3

Warning: package 'workflows' was built under R version 4.2.3

Warning: package 'workflowsets' was built under R version 4.2.3

Warning: package 'yardstick' was built under R version 4.2.3

Let’s look at the data:

url <- "https://storage.googleapis.com/kagglesdsdata/datasets/1623385/2668247/survey%20lung%20cancer.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240506%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240506T191835Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2d244357131a33758f10bf7f3a83f5584f02ba69ab6ef5e31cb5f7c9cc3ea73922db7cb03bb47b5345455b27f4c25f4434be415327d7817b954933fe86aefc50b07fb4e6a0736238366f4a922a91615222be63de02ab687a9759eace7c59e62468deb656a3abafea85a93f0fd77fac708c2d37bece32da9669b5add662affdcfa40c73b9a8413d9039f207492544aba4ddd2f13baaa10a4b92327ea52db9322797db7113eeb00deb5ed073ddde2c58915c7b94710a2bec1dc03e4950fb20234837f03b0d4a7d5bbbbcde8c98eb2c1e7355acd3c179adf9f66d626c27ee70b46828f7d1e0b8222e446296fc3d813b8dfd42219bc6887266bf6bcc9c2ae2750aaa"
df <- read_csv(url)

Rows: 309 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): GENDER, LUNG_CANCER
dbl (14): AGE, SMOKING, YELLOW_FINGERS, ANXIETY, PEER_PRESSURE, CHRONIC DISE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(df)

Data summary
Name	df
Number of rows	309
Number of columns	16
_______________________
Column type frequency:
character	2
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
GENDER	0	1	1	1	0	2	0
LUNG_CANCER	0	1	2	3	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
AGE	1	62.67	8.21	21	57	62	69	87	▁▁▆▇▂
SMOKING	1	1.56	0.50	1	1	2	2	2	▆▁▁▁▇
YELLOW_FINGERS	1	1.57	0.50	1	1	2	2	2	▆▁▁▁▇
ANXIETY	1	1.50	0.50	1	1	1	2	2	▇▁▁▁▇
PEER_PRESSURE	1	1.50	0.50	1	1	2	2	2	▇▁▁▁▇
CHRONIC DISEASE	1	1.50	0.50	1	1	2	2	2	▇▁▁▁▇
FATIGUE	1	1.67	0.47	1	1	2	2	2	▃▁▁▁▇
ALLERGY	1	1.56	0.50	1	1	2	2	2	▆▁▁▁▇
WHEEZING	1	1.56	0.50	1	1	2	2	2	▆▁▁▁▇
ALCOHOL CONSUMING	1	1.56	0.50	1	1	2	2	2	▆▁▁▁▇
COUGHING	1	1.58	0.49	1	1	2	2	2	▆▁▁▁▇
SHORTNESS OF BREATH	1	1.64	0.48	1	1	2	2	2	▅▁▁▁▇
SWALLOWING DIFFICULTY	1	1.47	0.50	1	1	1	2	2	▇▁▁▁▇
CHEST PAIN	1	1.56	0.50	1	1	2	2	2	▆▁▁▁▇

From this, we understand that almost all of the variables are binary “Y|N” questions, with the exception of age. Let’s see the distribution of data:

df %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar()

This is imbalanced data for sure, as we can see there are multiple positive cases of lung cancer than negative. We’ll keep this in mind for later. For now, let’s see if there’s any noticeable distribution differences in the two classes:

df %>%
    group_by(LUNG_CANCER) %>%
    skim()

Data summary
Name	Piped data
Number of rows	309
Number of columns	16
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	LUNG_CANCER

Variable type: character

skim_variable	LUNG_CANCER	n_missing	complete_rate	min	max	empty	n_unique	whitespace
GENDER	NO	0	1	1	1	0	2	0
GENDER	YES	0	1	1	1	0	2	0

Variable type: numeric

skim_variable	LUNG_CANCER	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
AGE	NO	1	60.74	9.63	21	57	61.0	65.5	87	▁▁▇▇▁
AGE	YES	1	62.95	7.97	38	58	62.5	69.0	81	▁▃▇▇▂
SMOKING	NO	1	1.49	0.51	1	1	1.0	2.0	2	▇▁▁▁▇
SMOKING	YES	1	1.57	0.50	1	1	2.0	2.0	2	▆▁▁▁▇
YELLOW_FINGERS	NO	1	1.33	0.48	1	1	1.0	2.0	2	▇▁▁▁▃
YELLOW_FINGERS	YES	1	1.60	0.49	1	1	2.0	2.0	2	▅▁▁▁▇
ANXIETY	NO	1	1.31	0.47	1	1	1.0	2.0	2	▇▁▁▁▃
ANXIETY	YES	1	1.53	0.50	1	1	2.0	2.0	2	▇▁▁▁▇
PEER_PRESSURE	NO	1	1.26	0.44	1	1	1.0	1.5	2	▇▁▁▁▃
PEER_PRESSURE	YES	1	1.54	0.50	1	1	2.0	2.0	2	▇▁▁▁▇
CHRONIC DISEASE	NO	1	1.36	0.49	1	1	1.0	2.0	2	▇▁▁▁▅
CHRONIC DISEASE	YES	1	1.53	0.50	1	1	2.0	2.0	2	▇▁▁▁▇
FATIGUE	NO	1	1.49	0.51	1	1	1.0	2.0	2	▇▁▁▁▇
FATIGUE	YES	1	1.70	0.46	1	1	2.0	2.0	2	▃▁▁▁▇
ALLERGY	NO	1	1.13	0.34	1	1	1.0	1.0	2	▇▁▁▁▁
ALLERGY	YES	1	1.62	0.49	1	1	2.0	2.0	2	▅▁▁▁▇
WHEEZING	NO	1	1.23	0.43	1	1	1.0	1.0	2	▇▁▁▁▂
WHEEZING	YES	1	1.60	0.49	1	1	2.0	2.0	2	▅▁▁▁▇
ALCOHOL CONSUMING	NO	1	1.18	0.39	1	1	1.0	1.0	2	▇▁▁▁▂
ALCOHOL CONSUMING	YES	1	1.61	0.49	1	1	2.0	2.0	2	▅▁▁▁▇
COUGHING	NO	1	1.26	0.44	1	1	1.0	1.5	2	▇▁▁▁▃
COUGHING	YES	1	1.63	0.48	1	1	2.0	2.0	2	▅▁▁▁▇
SHORTNESS OF BREATH	NO	1	1.56	0.50	1	1	2.0	2.0	2	▆▁▁▁▇
SHORTNESS OF BREATH	YES	1	1.65	0.48	1	1	2.0	2.0	2	▅▁▁▁▇
SWALLOWING DIFFICULTY	NO	1	1.13	0.34	1	1	1.0	1.0	2	▇▁▁▁▁
SWALLOWING DIFFICULTY	YES	1	1.52	0.50	1	1	2.0	2.0	2	▇▁▁▁▇
CHEST PAIN	NO	1	1.31	0.47	1	1	1.0	2.0	2	▇▁▁▁▃
CHEST PAIN	YES	1	1.59	0.49	1	1	2.0	2.0	2	▆▁▁▁▇

It looks like fatigue, yellow fingers, allergies, alcohol consumption, and swallowing difficulty are are likely going to be particularly discriminant in differentiating the diagnoses, because they have noticeable imbalance for each case in the histograms.

First Model

Because this is a dataset with a lot of binary decisions, it makes the most sense to use a decision-tree-based model for this data. Additionally, we’re going to use a validation set to test our model so that we don’t commit data leakage.

train_test_split <- initial_validation_split(df, prop = c(0.60, 0.2), strata = LUNG_CANCER)

train <- training(train_test_split)
test <- testing(train_test_split)
val <- validation(train_test_split)

Now using tidymodels, build a workflow:

rec <- recipe(LUNG_CANCER ~ ., data = train)

rf_mod <- rand_forest(mode = "classification", trees = 2000)

rf_flow <- workflow() %>%
    add_recipe(rec) %>%
    add_model(rf_mod)

rf_fit <- fit(rf_flow, train)

Our naive fit:

classification_metrics <- metric_set(accuracy, f_meas)

rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.935
2 f_meas   binary         0.6

So on the validation set, we had high accuracy but a low F-score. Let’s see why that is:

rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)

          Truth
Prediction NO YES
       NO   3   1
       YES  3  55

So even in this small case, we were leaning towards predicting someone without cancer as them having cancer, which is a false positive. Let’s see if we can remedy this at all by tuning some of the random forest parameters:

Hyperparameter Tuning

tune_spec <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("ranger")

folds <- vfold_cv(train)

tune_rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(tune_spec)

set.seed(12345)
tune_res <- tune_grid(
  tune_rf_wf,
  resamples = folds,
  grid = 20,
  metrics = classification_metrics
)

i Creating pre-processing data to finalize unknown parameter: mtry

Warning: package 'ranger' was built under R version 4.2.3

→ A | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 4 true event(s) actually occurred for the problematic event level, NO

There were issues with some computations   A: x1

→ B | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 1 true event(s) actually occurred for the problematic event level, NO

There were issues with some computations   A: x1
→ C | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 2 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1
There were issues with some computations   A: x1   B: x2   C: x1
→ D | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 3 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2   C: x1
There were issues with some computations   A: x1   B: x3   C: x1   D: x1
There were issues with some computations   A: x1   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3

Now let’s look at how the classification metrics over these tuned parameters:

tune_res %>%
    collect_metrics() %>%
    filter(.metric == "f_meas") %>%
    arrange(-mean)

# A tibble: 20 × 8
    mtry min_n .metric .estimator    mean     n std_err .config              
   <int> <int> <chr>   <chr>        <dbl> <int>   <dbl> <chr>                
 1     5    16 f_meas  binary       0.410     7   0.117 Preprocessor1_Model08
 2     4     4 f_meas  binary       0.410     7   0.117 Preprocessor1_Model10
 3    10    13 f_meas  binary       0.396     8   0.125 Preprocessor1_Model05
 4     7     9 f_meas  binary       0.396     8   0.125 Preprocessor1_Model06
 5    14    11 f_meas  binary       0.396     8   0.125 Preprocessor1_Model11
 6    12     6 f_meas  binary       0.396     8   0.125 Preprocessor1_Model18
 7     6    21 f_meas  binary       0.371     7   0.108 Preprocessor1_Model01
 8     7    21 f_meas  binary       0.371     7   0.108 Preprocessor1_Model07
 9     8    27 f_meas  binary       0.367     6   0.128 Preprocessor1_Model19
10     8    14 f_meas  binary       0.358     8   0.113 Preprocessor1_Model14
11     9    18 f_meas  binary       0.358     8   0.113 Preprocessor1_Model15
12    14    32 f_meas  binary       0.327     5   0.153 Preprocessor1_Model02
13    10    26 f_meas  binary       0.314     7   0.120 Preprocessor1_Model03
14    13    36 f_meas  binary       0.167     3   0.167 Preprocessor1_Model13
15    13    35 f_meas  binary       0.167     3   0.167 Preprocessor1_Model17
16     3    29 f_meas  binary       0         1  NA     Preprocessor1_Model04
17    11    38 f_meas  binary       0         2   0     Preprocessor1_Model12
18     1     7 f_meas  binary       0         1  NA     Preprocessor1_Model16
19     2    24 f_meas  binary       0         1  NA     Preprocessor1_Model20
20     3    34 f_meas  binary     NaN         0  NA     Preprocessor1_Model09

The F score hasn’t improved by resampling, so instead, let’s over sample the minority case in our recipe:

SMOTE for Imbalance

rec %>%
    step_dummy(GENDER) %>%
    step_smote(LUNG_CANCER) -> rec_ov_sampled

Let’s see how this affects the data:

rec %>%
    prep() %>%
    juice() %>%
    group_by(LUNG_CANCER) %>%
    skim()

Data summary
Name	Piped data
Number of rows	185
Number of columns	16
_______________________
Column type frequency:
factor	1
numeric	14
________________________
Group variables	LUNG_CANCER

Variable type: factor

skim_variable	LUNG_CANCER	n_missing	complete_rate	ordered	n_unique	top_counts
GENDER	NO	0	1	FALSE	2	F: 13, M: 10
GENDER	YES	0	1	FALSE	2	M: 87, F: 75

Variable type: numeric

skim_variable	LUNG_CANCER	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
AGE	NO	1	61.43	11.82	21	57	63	68.0	87	▁▁▅▇▁
AGE	YES	1	62.80	8.61	38	58	62	70.0	81	▁▃▇▆▃
SMOKING	NO	1	1.48	0.51	1	1	1	2.0	2	▇▁▁▁▇
SMOKING	YES	1	1.61	0.49	1	1	2	2.0	2	▅▁▁▁▇
YELLOW_FINGERS	NO	1	1.30	0.47	1	1	1	2.0	2	▇▁▁▁▃
YELLOW_FINGERS	YES	1	1.60	0.49	1	1	2	2.0	2	▅▁▁▁▇
ANXIETY	NO	1	1.22	0.42	1	1	1	1.0	2	▇▁▁▁▂
ANXIETY	YES	1	1.51	0.50	1	1	2	2.0	2	▇▁▁▁▇
PEER_PRESSURE	NO	1	1.17	0.39	1	1	1	1.0	2	▇▁▁▁▂
PEER_PRESSURE	YES	1	1.54	0.50	1	1	2	2.0	2	▇▁▁▁▇
CHRONIC DISEASE	NO	1	1.39	0.50	1	1	1	2.0	2	▇▁▁▁▅
CHRONIC DISEASE	YES	1	1.55	0.50	1	1	2	2.0	2	▆▁▁▁▇
FATIGUE	NO	1	1.52	0.51	1	1	2	2.0	2	▇▁▁▁▇
FATIGUE	YES	1	1.70	0.46	1	1	2	2.0	2	▃▁▁▁▇
ALLERGY	NO	1	1.17	0.39	1	1	1	1.0	2	▇▁▁▁▂
ALLERGY	YES	1	1.62	0.49	1	1	2	2.0	2	▅▁▁▁▇
WHEEZING	NO	1	1.26	0.45	1	1	1	1.5	2	▇▁▁▁▃
WHEEZING	YES	1	1.60	0.49	1	1	2	2.0	2	▅▁▁▁▇
ALCOHOL CONSUMING	NO	1	1.22	0.42	1	1	1	1.0	2	▇▁▁▁▂
ALCOHOL CONSUMING	YES	1	1.60	0.49	1	1	2	2.0	2	▅▁▁▁▇
COUGHING	NO	1	1.35	0.49	1	1	1	2.0	2	▇▁▁▁▅
COUGHING	YES	1	1.62	0.49	1	1	2	2.0	2	▅▁▁▁▇
SHORTNESS OF BREATH	NO	1	1.61	0.50	1	1	2	2.0	2	▅▁▁▁▇
SHORTNESS OF BREATH	YES	1	1.68	0.47	1	1	2	2.0	2	▃▁▁▁▇
SWALLOWING DIFFICULTY	NO	1	1.09	0.29	1	1	1	1.0	2	▇▁▁▁▁
SWALLOWING DIFFICULTY	YES	1	1.57	0.50	1	1	2	2.0	2	▆▁▁▁▇
CHEST PAIN	NO	1	1.30	0.47	1	1	1	2.0	2	▇▁▁▁▃
CHEST PAIN	YES	1	1.62	0.49	1	1	2	2.0	2	▅▁▁▁▇

rec %>%
    prep() %>%
    juice() %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar() +
    labs(title = "Lung Cancer diagnoses before SMOTE")

rec_ov_sampled %>%
    prep() %>%
    juice() %>%
    group_by(LUNG_CANCER) %>%
    skim()

Data summary
Name	Piped data
Number of rows	324
Number of columns	16
_______________________
Column type frequency:
numeric	15
________________________
Group variables	LUNG_CANCER

Variable type: numeric

skim_variable	LUNG_CANCER	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
AGE	NO	1	62.12	7.94	21	56.96	63.00	68.25	87	▁▁▅▇▁
AGE	YES	1	62.80	8.61	38	58.00	62.00	70.00	81	▁▃▇▆▃
SMOKING	NO	1	1.51	0.41	1	1.00	1.55	1.97	2	▇▂▂▂▇
SMOKING	YES	1	1.61	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
YELLOW_FINGERS	NO	1	1.32	0.41	1	1.00	1.00	1.78	2	▇▁▁▁▃
YELLOW_FINGERS	YES	1	1.60	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
ANXIETY	NO	1	1.24	0.35	1	1.00	1.00	1.52	2	▇▁▁▁▂
ANXIETY	YES	1	1.51	0.50	1	1.00	2.00	2.00	2	▇▁▁▁▇
PEER_PRESSURE	NO	1	1.20	0.34	1	1.00	1.00	1.34	2	▇▁▁▁▁
PEER_PRESSURE	YES	1	1.54	0.50	1	1.00	2.00	2.00	2	▇▁▁▁▇
CHRONIC DISEASE	NO	1	1.32	0.39	1	1.00	1.05	1.67	2	▇▂▁▁▃
CHRONIC DISEASE	YES	1	1.55	0.50	1	1.00	2.00	2.00	2	▆▁▁▁▇
FATIGUE	NO	1	1.46	0.42	1	1.00	1.37	2.00	2	▇▂▂▂▆
FATIGUE	YES	1	1.70	0.46	1	1.00	2.00	2.00	2	▃▁▁▁▇
ALLERGY	NO	1	1.17	0.31	1	1.00	1.00	1.22	2	▇▁▁▁▁
ALLERGY	YES	1	1.62	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
WHEEZING	NO	1	1.27	0.38	1	1.00	1.00	1.56	2	▇▁▁▁▂
WHEEZING	YES	1	1.60	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
ALCOHOL CONSUMING	NO	1	1.22	0.34	1	1.00	1.00	1.42	2	▇▁▁▁▁
ALCOHOL CONSUMING	YES	1	1.60	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
COUGHING	NO	1	1.37	0.40	1	1.00	1.18	1.77	2	▇▂▁▂▃
COUGHING	YES	1	1.62	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
SHORTNESS OF BREATH	NO	1	1.56	0.41	1	1.14	1.58	2.00	2	▅▂▂▁▇
SHORTNESS OF BREATH	YES	1	1.68	0.47	1	1.00	2.00	2.00	2	▃▁▁▁▇
SWALLOWING DIFFICULTY	NO	1	1.10	0.26	1	1.00	1.00	1.00	2	▇▁▁▁▁
SWALLOWING DIFFICULTY	YES	1	1.57	0.50	1	1.00	2.00	2.00	2	▆▁▁▁▇
CHEST PAIN	NO	1	1.34	0.39	1	1.00	1.12	1.75	2	▇▁▁▂▃
CHEST PAIN	YES	1	1.62	0.49	1	1.00	2.00	2.00	2	▅▁▁▁▇
GENDER_M	NO	1	0.44	0.42	0	0.00	0.37	0.93	1	▇▁▂▂▅
GENDER_M	YES	1	0.54	0.50	0	0.00	1.00	1.00	1	▇▁▁▁▇

rec_ov_sampled %>%
    prep() %>%
    juice() %>%
    ggplot(aes(x=LUNG_CANCER)) +
    geom_bar() +
    labs(title = "Lung Cancer diagnoses after SMOTE")

Let’s try the model out with this recipe instead…

rf_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(rf_mod)

rf_fit <- fit(rf_smote_flow, train)

rf_fit %>%
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)

          Truth
Prediction NO YES
       NO   3   1
       YES  3  55

In this case, we’ve squeezed one better prediction out, and importantly it was indeed a false negative that we converted to a true positive.

I believe it’s probably time to try a different model — maybe RF isn’t best suited for this task.

SVM

Support vector machines are also good classifiers. Let’s try that out.

svm_mod <- svm_linear(
  cost = double(1),
  margin = double(1)
) %>%  
  set_mode("classification")

svm_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(svm_mod)

svm_fit <- fit(svm_smote_flow, train)

svm_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)

          Truth
Prediction NO YES
       NO   6  26
       YES  0  30

Interestingly, this model has more false negatives! Very bad for a field like cancer. Let’s try to tune it:

tune_spec <- svm_linear(
  cost = tune(),
  margin = tune()
) %>%  
  set_mode("classification")

tune_svm_wf <- workflow() %>%
  add_recipe(rec_ov_sampled) %>%
  add_model(tune_spec)

tune_res <- tune_grid(
  tune_svm_wf,
  resamples = folds,
  grid = 50,
  metrics = classification_metrics
)

Warning: package 'LiblineaR' was built under R version 4.2.3

→ A | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 4 true event(s) actually occurred for the problematic event level, NO

There were issues with some computations   A: x1

→ B | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 1 true event(s) actually occurred for the problematic event level, NO

There were issues with some computations   A: x1
There were issues with some computations   A: x1   B: x1
There were issues with some computations   A: x1   B: x2
→ C | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 2 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2
→ D | warning: While computing binary `precision()`, no predicted events were detected (i.e.
               `true_positive + false_positive = 0`).
               Precision is undefined in this case, and `NA` will be returned.
               Note that 3 true event(s) actually occurred for the problematic event level, NO
There were issues with some computations   A: x1   B: x2
There were issues with some computations   A: x1   B: x2   C: x1   D: x1
There were issues with some computations   A: x1   B: x3   C: x1   D: x2
There were issues with some computations   A: x1   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3
There were issues with some computations   A: x2   B: x4   C: x1   D: x3

tune_res %>%
    collect_metrics() %>%
    filter(.metric == "f_meas") %>%
    arrange(-mean)

# A tibble: 50 × 8
       cost  margin .metric .estimator  mean     n std_err .config              
      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1  0.0595  0.0157  f_meas  binary     0.435     8  0.110  Preprocessor1_Model32
 2  0.932   0.112   f_meas  binary     0.408     7  0.102  Preprocessor1_Model19
 3 24.6     0.194   f_meas  binary     0.388    10  0.102  Preprocessor1_Model16
 4  0.641   0.0502  f_meas  binary     0.388     9  0.0989 Preprocessor1_Model14
 5  0.00602 0.145   f_meas  binary     0.374     9  0.0959 Preprocessor1_Model33
 6  0.104   0.152   f_meas  binary     0.362     7  0.0489 Preprocessor1_Model36
 7  6.29    0.0538  f_meas  binary     0.347    10  0.0891 Preprocessor1_Model03
 8  0.199   0.179   f_meas  binary     0.344     9  0.0969 Preprocessor1_Model18
 9  0.00434 0.00752 f_meas  binary     0.343     8  0.0981 Preprocessor1_Model17
10  2.40    0.00953 f_meas  binary     0.343     8  0.0977 Preprocessor1_Model37
# ℹ 40 more rows

It looks like there is an improvement in F score over the random forest, so I’ll stick with this model for now.

Let’s try these best parameters:

best <- tune_res %>% select_best(metric = "f_meas")

svm_best <- svm_linear(
  cost = best$cost,
  margin = best$margin
) %>%  
  set_mode("classification")

svm_smote_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(svm_best)

svm_fit <- fit(svm_smote_flow, train)

svm_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.919
2 f_meas   binary         0.444

Not much improvement!

Another Model: Decision Tree `rpart`

I’ve had some success with this algorithm in the past, let’s see if it works here:

rpart_mod <- decision_tree(
    mode = "classification"
)

rpart_flow <- workflow() %>%
    add_recipe(rec_ov_sampled) %>%
    add_model(rpart_mod)

rpart_fit <- fit(svm_smote_flow, train)

rpart_fit %>% 
    predict(val) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)

          Truth
Prediction NO YES
       NO   6  12
       YES  0  44

Not too much better, to be honest. For the sake of time, let’s settle on our best, the tuned SVM with a margin of 0.0157203 and cost of 0.0595231.

svm_fit %>%
    predict(test) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    conf_mat(truth = actual, estimate = .pred_class)

          Truth
Prediction NO YES
       NO   1   2
       YES  5  54

svm_fit %>%
    predict(test) %>%
    bind_cols(., actual = as.factor(val$LUNG_CANCER)) %>%
    classification_metrics(truth = actual, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.887
2 f_meas   binary         0.222

Conclusion

One of the most important aspects of disease prediction, especially for a disease like cancer, is that your models should ultimately be attempting to reduce harmful errors. In this case, false negatives would be devastating as they can be costly. In this experiment, I found using the F-measure to be a good metric as it let’s us measure not just accuracy, but also the likelihood of making such errors (called Recall in this example).

The final model underperformed in this respect, but I’m sure given some more time, there is an ideal model and parameter set that would reduce false negatives.

Introduction

First Model

Hyperparameter Tuning

SMOTE for Imbalance

SVM

Another Model: Decision Tree rpart

Conclusion

Another Model: Decision Tree `rpart`