Bootstrapping is implemented in cutpointr with two goals:
This vignette will briefly go through some examples for both approaches.
maximize_metric
As a first basic example, the cutpoint optimization will be
demonstrated without any bootstrapping by maximizing the Youden-Index.
Using the method maximize_metric
, this is performed on the
full data set:
library(cutpointr)
data(suicide)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = youden,
pos_class = "yes",
direction = ">="
)
summary(opt_cut)
## Method: maximize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint youden acc sensitivity specificity tp fn fp tn
## 2 0.7518 0.8647 0.8889 0.8629 32 4 68 428
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
The fields in the resulting R object opt_cut
are to be
interpreted as follows:
$optimal_cutpoint
: The optimal cutpoint determined by
maximizing the Youden-Index on the full suicide
dataset.$sensitivity
: The sensitivity when applying the
cutpoint to the full dataset.$specificity
: The specificity when applying the
cutpoint to the full dataset.$youden
: The maximal Youden-Index (= sensitivity +
specificity - 1), determined by the optimization.maximize_boot_metric
The determination of the optimal cutpoint can also be performed using
bootstrapping. Therefore, the methods
maximize_boot_metric
/minimize_boot_metric
need
to be chosen. These functions provide further arguments that can be used
to configure the bootstrapping. These arguments can be viewed with
help("maximize_boot_metric", "cutpointr")
. The most
important arguments are:
boot_cut
: The number of bootstrapping repetitions.boot_stratify
: If the bootstrap samples are drawn in
both classes separately before combining them, keep the number of
positives/negatives constant in every sample.summary_func
: The summary function to aggregate the
optimal cutpoints from the bootstrapping to arrive at one final optimal
cutpoint.The cutpoint is optimized in n=boot_cut
bootstrap
samples by maximizing/ minimizing the respective metric (e.g., the
Youden-index in this example) in each of these bootstrap samples.
Finally, the summary function is applied to aggregate the optimal
cutpoints from the n=boot_cut
bootstrap samples into one
final ‘optimal’ cutpoint.
set.seed(123)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_boot_metric,
boot_cut = 200,
summary_func = mean,
metric = youden,
pos_class = "yes",
direction = ">="
)
summary(opt_cut)
## Method: maximize_boot_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint youden acc sensitivity specificity tp fn fp tn
## 2.055 0.6927 0.8816 0.8056 0.8871 29 7 56 440
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
The fields in the resulting R object opt_cut
are to be
interpreted as follows:
$optimal_cutpoint
: The optimal cutpoint, which is the
aggregated value (as defined with summary_func
) over all
n=boot_cut
bootstrap samples. Please note that no
uncertainty measure (standard deviation, 95%-CI, etc.) is available here
(a bootstrap distribution of these cutpoints can be generated using
outer bootstrapping with boot_runs > 0
and
maximize_metric
, as explained below).$sensitivity
: The sensitivity when applying the optimal
cutpoint to the full dataset.$specificity
: The specificity when applying the optimal
cutpoint to the full dataset.$youden
: The Youden-Index when applying the optimal
cutpoint to the full dataset.Any chosen methods to find the optimal cutpoints can be subsequently
validated with bootstrapping. This can easily be activated by setting
the argument boot_runs
> 0. Please be aware that the
first steps to calculate the optimal cutpoints with the specified method
(as described above) will be performed in the very same manner as above,
resulting in the same outputs as above (depending on the seed when
bootstrapping cutpoints).
However, the method to calculate the optimal cutpoints will then
additionally be performed on n=boot_runs
bootstrap samples.
For each of these bootstrap samples, several metrics and performance
measures are available from the resulting $boot
object,
both for the in-bag (suffix: ‘_b’) and the out-of-bag
(suffix: ‘_oob’) bootstrap samples. Please note that the optimal
cutpoint is determined on the in-bag samples only and then just applied
to the out-of-bag samples for validation purposes, so its value is
available only once in the $boot
object without a
suffix.
maximize_metric
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = youden,
pos_class = "yes",
direction = ">=",
boot_runs = 100
)
## Running bootstrap...
The interpretation of fields in the resulting R object
opt_cut
is the same as above. The results from the
bootstrapping are available from $boot
.
## Method: maximize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
## Nr. of bootstraps: 100
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint youden acc sensitivity specificity tp fn fp tn
## 2 0.7518 0.8647 0.8889 0.8629 32 4 68 428
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
##
## Bootstrap summary:
## Variable Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## optimal_cutpoint 1.00 1.00 2.00 2.00 2.08 2.00 4.00 4.00 0.69 0
## AUC_b 0.85 0.89 0.90 0.92 0.92 0.94 0.96 0.97 0.02 0
## AUC_oob 0.82 0.86 0.91 0.93 0.92 0.95 0.97 0.98 0.04 0
## youden_b 0.60 0.67 0.72 0.75 0.75 0.79 0.85 0.89 0.06 0
## youden_oob 0.49 0.58 0.67 0.73 0.72 0.78 0.84 0.87 0.08 0
## acc_b 0.74 0.77 0.86 0.87 0.86 0.88 0.91 0.92 0.04 0
## acc_oob 0.74 0.77 0.84 0.86 0.86 0.88 0.90 0.92 0.04 0
## sensitivity_b 0.76 0.82 0.86 0.89 0.90 0.93 0.97 1.00 0.05 0
## sensitivity_oob 0.60 0.69 0.81 0.87 0.86 0.92 1.00 1.00 0.09 0
## specificity_b 0.72 0.76 0.85 0.87 0.86 0.88 0.91 0.92 0.04 0
## specificity_oob 0.73 0.76 0.84 0.86 0.86 0.88 0.91 0.93 0.04 0
## cohens_kappa_b 0.19 0.25 0.38 0.43 0.41 0.46 0.52 0.56 0.07 0
## cohens_kappa_oob 0.15 0.25 0.34 0.39 0.39 0.44 0.49 0.56 0.08 0
## # A tibble: 6 × 23
## optimal_cutpoint AUC_b AUC_oob youden_b youden_oob acc_b acc_oob sensitivity_b
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 0.891 0.95 0.732 0.698 0.852 0.863 0.882
## 2 1 0.912 0.969 0.705 0.753 0.774 0.769 0.943
## 3 2 0.902 0.934 0.718 0.780 0.842 0.918 0.879
## 4 2 0.892 0.961 0.662 0.808 0.842 0.880 0.818
## 5 2 0.893 0.966 0.701 0.818 0.850 0.909 0.851
## 6 2 0.941 0.909 0.788 0.755 0.891 0.843 0.897
## # ℹ 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
## # specificity_oob <dbl>, cohens_kappa_b <dbl>, cohens_kappa_oob <dbl>,
## # TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>,
## # TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>
maximize_boot_metric
When bootstrapping cutpoints and also using the validation with
bootstrapping, the optimal cutpoint will again first be determined as
above in n=boot_cut
bootstrap samples by maximizing/
minimizing the respective metric in each of these bootstrap samples and
then by applying the summary function to aggregate the optimal cutpoints
from the n=boot_cut
bootstrap samples into one final
‘optimal’ cutpoint. Hence, using the same seeds here results in the same
outputs as above, where no outer bootstrapping is applied.
In the validation routine, the chosen cutpoint optimization is then
repeated in each of the n=boot_runs
(outer) bootstrap
samples: the optimal cutpoint is determined in each bootstrap sample by
optimizing the metric
on n=boot_cut
(inner)
bootstrap samples and applying the summary_func
to
aggregate them into one value.
Since the (inner) bootstrapping of optimal cutpoints is performed in
each of the (outer) validation bootstrap samples, this can be
computational very expensive and take some time to finish. Therefore,
parallelization is implemented in cutpointr
by just setting
its argument allowParallel = TRUE
and initializing a
parallel environment.
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: rngtools
cl <- makeCluster(2) # 2 cores
registerDoParallel(cl)
registerDoRNG(12)
set.seed(123)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_boot_metric,
boot_cut = 200,
summary_func = mean,
metric = youden,
pos_class = "yes",
direction = ">=",
boot_runs = 100,
allowParallel = TRUE
)
## Running bootstrap...
Again, the interpretation of fields in the resulting R object
opt_cut
is the same as above. The results from the
bootstrapping are available from $boot
.
## Method: maximize_boot_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
## Nr. of bootstraps: 100
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint youden acc sensitivity specificity tp fn fp tn
## 2.055 0.6927 0.8816 0.8056 0.8871 29 7 56 440
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
##
## Bootstrap summary:
## Variable Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## optimal_cutpoint 1.07 1.60 1.93 2.08 2.16 2.28 2.97 3.60 0.45 0
## AUC_b 0.86 0.89 0.91 0.93 0.93 0.94 0.96 0.96 0.02 0
## AUC_oob 0.84 0.88 0.90 0.92 0.92 0.95 0.97 0.98 0.03 0
## youden_b 0.60 0.63 0.68 0.72 0.72 0.76 0.79 0.84 0.05 0
## youden_oob 0.48 0.57 0.64 0.69 0.71 0.78 0.84 0.88 0.09 0
## acc_b 0.83 0.85 0.87 0.88 0.88 0.89 0.91 0.93 0.02 0
## acc_oob 0.83 0.84 0.86 0.88 0.88 0.89 0.91 0.92 0.02 0
## sensitivity_b 0.71 0.75 0.80 0.83 0.84 0.87 0.91 0.94 0.05 0
## sensitivity_oob 0.58 0.69 0.75 0.81 0.83 0.91 1.00 1.00 0.10 0
## specificity_b 0.83 0.85 0.87 0.88 0.88 0.89 0.91 0.94 0.02 0
## specificity_oob 0.82 0.83 0.87 0.88 0.88 0.90 0.92 0.93 0.03 0
## cohens_kappa_b 0.32 0.33 0.38 0.42 0.43 0.47 0.54 0.59 0.06 0
## cohens_kappa_oob 0.22 0.31 0.37 0.41 0.41 0.47 0.53 0.56 0.07 0
## # A tibble: 6 × 23
## optimal_cutpoint AUC_b AUC_oob youden_b youden_oob acc_b acc_oob sensitivity_b
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.34 0.931 0.950 0.729 0.613 0.870 0.892 0.857
## 2 2.58 0.948 0.896 0.725 0.628 0.883 0.867 0.839
## 3 1.70 0.939 0.889 0.784 0.711 0.872 0.846 0.914
## 4 1.95 0.894 0.962 0.680 0.844 0.861 0.851 0.816
## 5 2.03 0.906 0.963 0.692 0.786 0.885 0.872 0.8
## 6 2.58 0.928 0.881 0.771 0.540 0.900 0.860 0.868
## # ℹ 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
## # specificity_oob <dbl>, cohens_kappa_b <dbl>, cohens_kappa_oob <dbl>,
## # TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>,
## # TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>
Some visualizations of the bootstrapping results are available with
the plot
function:
The two plots in the lower half can be generated separately with
plot_cut_boot(opt_cut)
and
plot_metric_boot(opt_cut)
.