To show how `tidyposterior`

compares models, let’s look at
a small data set. The `modeldata`

package has a data set
called `two_class_dat`

that has 791 data points on to
predictors. The outcome is a two-level factor. There is some linear-ish
separation between the classes but hints that a nonlinear class boundary
might do slightly better.

```
library(tidymodels)
library(tidyposterior)
data(two_class_dat)
ggplot(two_class_dat, aes(x = A, y = B, col = Class)) +
geom_point(alpha = 0.3, cex = 2) +
coord_fixed()
```

`tidyposterior`

models performance statistics produced by
models, such as RMSE, accuracy, or the area under the ROC curve. It
relies on resampling to produce replicates of these performance
statistics so that they can be modeled.

We’ll use simple 10-fold cross-validation here. Any other resampling
method from `rsample`

, except a simple validation set, would
also be appropriate.

```
set.seed(1)
cv_folds <- vfold_cv(two_class_dat)
cv_folds
```

```
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [711/80]> Fold01
## 2 <split [712/79]> Fold02
## 3 <split [712/79]> Fold03
## 4 <split [712/79]> Fold04
## 5 <split [712/79]> Fold05
## 6 <split [712/79]> Fold06
## 7 <split [712/79]> Fold07
## 8 <split [712/79]> Fold08
## 9 <split [712/79]> Fold09
## 10 <split [712/79]> Fold10
```

We’ll use a logistic regression model for these data and initially consider two different preprocessing methods that might help the fit. Let’s define a model specification:

`logistic_spec <- logistic_reg() %>% set_engine("glm")`

## Comparing modeling/prepreocessing methods

One way to incorporate nonlinearity into the class boundary is to use
a spline basis expansion for the predictors. A `recipe`

step
using `step_ns`

will will encode the predictors in this way.
The degrees of freedom will be hard-coded to produce two additional
feature columns per predictor:

```
spline_rec <-
recipe(Class ~ ., data = two_class_dat) %>%
step_ns(A, B, deg_free = 3)
spline_wflow <-
workflow() %>%
add_recipe(spline_rec) %>%
add_model(logistic_spec)
```

Binding the model and recipe into a workflow creates a simple
interface when we `fit()`

and `predict()`

the data
(but isn’t required by `tidypredict`

).

An alternate preprocessing method is to normalize the data using a spatial sign transformation. This projects the predictors on to a unit circle and can sometimes mitigate the effect of collinearity or outliers. A recipe is also used. Here is a visual representation of the data after the transformation:

```
spatial_sign_rec <-
recipe(Class ~ ., data = two_class_dat) %>%
step_normalize(A, B) %>%
step_spatialsign(A, B)
spatial_sign_rec %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot(aes(x = A, y = B, col = Class)) +
geom_point(alpha = 0.3, cex = 2) +
coord_fixed()
```

Another workflow is created for this method:

`tidyposterior`

does not require the user to create their
models using tidymodels packages, `caret`

, or any other
method (although there are advantages to using those tools). In the end
a data frame format with resample identifiers and columns for
performance statistics are needed.

To produce this format with our tidymodels objects, this small convenience function will create a model on the 90% of the data allocated by cross-validation, predict the other 10%, then calculate the area under the ROC curve. If you use tidymodels, there are high-level interfaces (shown below) that don’t require such a function.

```
compute_roc <- function(split, wflow) {
# Fit the model to 90% of the data
mod <- fit(wflow, data = analysis(split))
# Predict the other 10%
pred <- predict(mod, new_data = assessment(split), type = "prob")
# Compute the area under the ROC curve
pred %>%
bind_cols(assessment(split)) %>%
roc_auc(Class, .pred_Class1) %>%
pull(.estimate)
}
```

For our `rsample`

object `cv_folds`

, let’s
create two columns of ROC statistics using this function in conjunction
with `purrr::map`

and `dplyr::mutate`

:

```
roc_values <-
cv_folds %>%
mutate(
spatial_sign = map_dbl(splits, compute_roc, spatial_sign_wflow),
splines = map_dbl(splits, compute_roc, spline_wflow)
)
roc_values
```

```
## # 10-fold cross-validation
## # A tibble: 10 × 4
## splits id spatial_sign splines
## <list> <chr> <dbl> <dbl>
## 1 <split [711/80]> Fold01 0.943 0.946
## 2 <split [712/79]> Fold02 0.861 0.852
## 3 <split [712/79]> Fold03 0.819 0.838
## 4 <split [712/79]> Fold04 0.844 0.873
## 5 <split [712/79]> Fold05 0.865 0.902
## 6 <split [712/79]> Fold06 0.868 0.909
## 7 <split [712/79]> Fold07 0.952 0.934
## 8 <split [712/79]> Fold08 0.904 0.920
## 9 <split [712/79]> Fold09 0.765 0.811
## 10 <split [712/79]> Fold10 0.851 0.889
```

```
# Overall ROC statistics per workflow:
summarize(
roc_values,
splines = mean(splines),
spatial_sign = mean(spatial_sign)
)
```

```
## # A tibble: 1 × 2
## splines spatial_sign
## <dbl> <dbl>
## 1 0.887 0.867
```

There is the suggestion that using splines is better than the spatial sign. It would be nice to have some inferential analysis that could tell us if the size of this difference is create than the experimental noise in the data.

`tidyposterior`

uses a Bayesian ANOVA model to compute
posterior distributions for the performance statistic of each modeling
method. This tells use the probabilistic distribution of the model
performance metrics and allows us to make more formal statements about
the superiority (or equivalence) of different models. *Tidy Models
with R* has a good explanation of how the Bayesian ANOVA model
works.

The main function to conduct the analysis is `perf_mod()`

.
The main argument is for the object containing the resampling
information and at least two numeric columns of performance statistics
(measuring the same metric). As described in `?perf_mod`

,
there are a variety of other object types that can be used for this
argument.

There are also options for statistical parameters of the analysis, such as any transformation of the output statistics that should be used and so on.

The main options in our analysis are passed through to the
`rstanarm`

function `stan_glmer()`

. These
include:

`seed`

: An integer that controls the random numbers used in the Bayesian model.`iter`

: The total number of Montre Carlo iterations used (including the burn-in samples).`chains`

: The number of independent Markov Chain Monte Carlo analyses to compute.`refresh`

: How often to update the log (a value of zero means no output).

Other options that can be helpful (but we’ll use their defaults):

`prior_intercept`

: The main argument in this analysis for specifying the prior distribution of the parameters.`family`

: The exponential family distribution for the performance statistics.`cores`

: The number of parallel workers to use to speed-up computations.

Our call to this function is:

`rset_mod <- perf_mod(roc_values, seed = 2, iter = 5000, chains = 5, refresh = 0)`

The `summary()`

function for this type of object shows the
output from `stan_glmer()`

. It’s long, so we show some of the
initial output:

```
##
## Model Info:
## function: stan_glmer
## family: gaussian [identity]
## formula: statistic ~ model + (1 | id)
## algorithm: sampling
## sample: 12500 (posterior sample size)
## priors: see help('prior_summary')
## observations: 20
## groups: id (10)
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 0.867 0.016 0.847 0.867 0.887
## modelsplines 0.020 0.010 0.009 0.020 0.032
## b[(Intercept) id:Fold01] 0.060 0.021 0.033 0.060 0.086
##
## <snip>
```

Assuming that our assumptions are appropriate, one of the main things
that we’d like to get out of the object are samples of the posterior
distributions for the performance metrics (per modeling method). The
`tidy()`

method will produce a data frame with such
samples:

`tidy(rset_mod, seed = 3)`

```
## # Posterior samples of performance
## # A tibble: 25,000 × 2
## model posterior
## <chr> <dbl>
## 1 spatial_sign 0.875
## 2 splines 0.898
## 3 spatial_sign 0.867
## 4 splines 0.892
## 5 spatial_sign 0.853
## 6 splines 0.869
## 7 spatial_sign 0.841
## 8 splines 0.872
## 9 spatial_sign 0.844
## 10 splines 0.873
## # ℹ 24,990 more rows
```

We require a `seed`

value since it is a sample.

There is a simple plotting method for the object too:

`autoplot(rset_mod)`

There is some overlap but, again, it would be better if we could quantify this.

To compare models, the `contrast_models()`

function
computes the posterior distributions of differences in performance
statistics between models. For example, what does the posterior look
like for the difference in performance for these two preprocessing
methods? By default, the function computes all possible differences (a
single contrast for this example). There are also `summary()`

and plot methods:

```
preproc_diff <- contrast_models(rset_mod, seed = 4)
summary(preproc_diff, seed = 5)
```

```
## # A tibble: 1 × 9
## contrast probability mean lower upper size pract_neg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 spatial_sign vs sp… 0.0194 -0.0201 -0.0356 -0.00473 0 NA
## # ℹ 2 more variables: pract_equiv <dbl>, pract_pos <dbl>
```

Since the difference is negative, the spline model appears better than the spatial sign method. The summary output quantifies this by producing a simple credible interval for the difference. The probability column also reflects this since it is the probability that the spline ROC scores are greater than the analogous statistics from the spatial sign model. A value of 0.5 would indicate no difference.

There is an additional analysis that can be used. The ROPE method,
short for *Region of Practical Equivalence*, is a method for
understanding the differences in models in less subjective way. For this
analysis, we would specify a practical effect size (usually before the
analysis). This quantity reflects what difference in the metric is
considered practically meaning full in the context of our problem. In
our example, if we saw two models with a difference in their ROC
statistics of 0.02, we might consider them effectually different (your
beliefs may differ).

Once we have settled on a value of this effect size (in the units of
the statistic), we can compute how much of the difference is within this
region of practical equivalence (in our example, this is
`[-0.02, 0.02]`

). If the difference is mostly within these
bounds, the models might be significantly different but not practically
different. Alternatively, if the differences are beyond this, they would
be different in both senses.

The summary and plot methods have optional arguments called
`size.`

The `summary()`

function computes the
probability of the posterior differences that fall inside and outside of
this region. The plot method shows it visually:

```
summary(preproc_diff, size = 0.02) %>%
select(contrast, starts_with("pract"))
```

```
## # A tibble: 1 × 4
## contrast pract_neg pract_equiv pract_pos
## <chr> <dbl> <dbl> <dbl>
## 1 spatial_sign vs splines 0.502 0.497 0.00072
```

`autoplot(preproc_diff, size = 0.02)`

For this analysis, there are about even odds that the difference
between these models is not practically important (since the
`pract_equiv`

is near 0.5).

### About our assumptions

Previously, the expression “assuming that our assumptions are appropriate” was used. This is an inferential analysis and the validity of our assumptions matter a great deal. There are a few assumptions for this analysis. The main one is that we’ve specified the outcome distribution well. We’ve models the area under the ROC curve. This is a statistic bounded (effectively) between 0.5 and 1.0. The variance of the statistic is probably related to the mean; there is likely less variation in scores near 1.0 than those near 0.5.

The default family for `stan_glmer()`

is Gaussian. Given
the characteristics of this metric, that assumption might seem
problematic.

However, Gaussian seems like a good first approach for this assumption. The rationale is based on the Central Limit Theorem. As the sample size increases, the sample mean statistic converges to normality despite the distribution of the individual data points. Our performance estimates are summary statistics and, if the training set size is “large enough”, they will exhibit behavior consistent with normality.

As a simple (and approximate) example/diagnostics, suppose we used a
simple ANOVA for the ROC statistics using `lm()`

. This is not
the same analysis as the one used by `tidyposterior`

, but the
regression parameter estimates should be fairly similar. For that
analysis, we can assess the normality of the residuals and see that they
are pretty consistent with the normality assumption:

```
roc_longer <-
roc_values %>%
select(-splits) %>%
pivot_longer(cols = c(-id), names_to = "preprocessor", values_to = "roc")
roc_fit <- lm(roc ~ preprocessor, roc_longer)
roc_fit %>%
augment() %>%
ggplot(aes(sample = .resid)) +
geom_qq() +
geom_qq_line(lty = 2) +
coord_fixed(ratio = 20)
```

If this were not the case there are a few things that we can do.

The easiest approach would be to use a variance stabilizing
transformation of the metrics and keep the Gaussian assumption.
`perf_mod()`

has a `transform`

argument that will
transform the outcome but still produce the posterior distributions in
the original units. This will help if the variation within each model
significantly changes over the range of the values. When transformed
back to the original units, the posteriors will have different
variances.

Another option that can help with heterogeneous variances is
`hetero_var.`

This fits a difference variance for each
modeling method. However, this may make convergence of the model more
difficult.

Finally, a different distribution can be assumed using the family
argument to `stan_glmer()`

. Since our metrics are numeric,
there are not many families to choose from.

## Evaluating sub-models

The previous example was a between-model comparison (where “model” really means statistical model plus preprocessing method). If the model must be tuned, there is also the issue of within-model comparisons.

For our spline analysis, we assumed that three degrees of freedom were appropriate. However, we might tune the model over that parameter to see what the best degrees of freedom should be.

The previous spline recipe is altered so that the degrees of freedom
parameter doesn’t have an actual value. Instead, using a value of
`tune()`

will mark this parameter for optimization. There are
a few different approaches for tuning this parameter; we’ll use simpe
grid search.

```
spline_rec <-
recipe(Class ~ ., data = two_class_dat) %>%
step_ns(A, B, deg_free = tune())
```

The `tune`

package function `tune_grid()`

is
used to evaluate several values of the parameter. For each value, the
resampled area under the ROC curve is computed.

```
spline_tune <-
logistic_spec %>%
tune_grid(
spline_rec,
resamples = cv_folds,
grid = tibble(deg_free = c(1, 3, 5, 10)),
metrics = metric_set(roc_auc),
control = control_grid(save_workflow = TRUE)
)
collect_metrics(spline_tune) %>%
arrange(desc(mean))
```

```
## # A tibble: 4 × 7
## deg_free .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 roc_auc binary 0.888 10 0.0141 Preprocessor1_Model1
## 2 3 roc_auc binary 0.887 10 0.0137 Preprocessor2_Model1
## 3 5 roc_auc binary 0.886 10 0.0133 Preprocessor3_Model1
## 4 10 roc_auc binary 0.883 10 0.0128 Preprocessor4_Model1
```

There is a `perf_mod()`

method for this type of object.
The computations are conducted in the same manner but, in this instance,
four sub-models are compared.

```
grid_mod <- perf_mod(spline_tune, seed = 6, iter = 5000, chains = 5, refresh = 0)
autoplot(grid_mod)
```

When the object given to perf_mod is from a model tuning function,
the `model`

column corresponds to the `.config`

column in the results.

There is a lot of overlap. The results do call into question the overall utility of using splines. A single degree of freedom model corresponds to a linear effect. Let’s compare the linear class boundaries to the other sub-models to see if splines are even improving the model.

The contrast_model function can take two lists of model identifiers
and compute their differences. Again, for tuning objects, this should
include values of `.config`

. This specification compute the
difference `{1 df - X df}`

so positive differences indicate
that the linear model is better.

```
grid_diff <-
contrast_models(
grid_mod,
list_1 = rep("Preprocessor1_Model1", 3),
list_2 = c(
"Preprocessor2_Model1", # <- 3 df spline
"Preprocessor3_Model1", # <- 5 df spline
"Preprocessor4_Model1" # <- 10 df spline
),
seed = 7
)
```

`summary(grid_diff)`

```
## # A tibble: 3 × 9
## contrast probability mean lower upper size pract_neg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Preprocessor1_Mode… 0.535 2.59e-4 -0.00573 0.00609 0 NA
## 2 Preprocessor1_Mode… 0.636 1.19e-3 -0.00475 0.00708 0 NA
## 3 Preprocessor1_Mode… 0.897 4.52e-3 -0.00153 0.0105 0 NA
## # ℹ 2 more variables: pract_equiv <dbl>, pract_pos <dbl>
```

`autoplot(grid_diff)`

The results indicate that a lot of degrees of freedom might make the model worse. At best, there is a limited difference in performance when more than one spline term is used.

The ROPE analysis is more definitive; there is no sense of practical differences within the previously used effect size:

`autoplot(grid_diff, size = 0.02)`

## Workflow sets

Workflow sets are collections of workflows and their results. These
can be made after existing workflows have been evaluated or by using
`workflow_set()`

to create an evaluate the models.

Let’s create an initial set that has difference combinations of the two predictors for this data set.

```
library(workflowsets)
logistic_set <-
workflow_set(
list(A = Class ~ A, B = Class ~ B, ratio = Class ~ I(log(A/B)),
spatial_sign = spatial_sign_rec),
list(logistic = logistic_spec)
)
logistic_set
```

```
## # A workflow set/tibble: 4 × 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 A_logistic <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 B_logistic <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 ratio_logistic <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 spatial_sign_logistic <tibble [1 × 4]> <opts[0]> <list [0]>
```

The `object`

volumn contains the workflows that are
created by the combinations of preprocessors and the model (multiple
models could have been used). Rather than calling the same functions
from the `tune`

package repeatedly, we can evaluate these
with a single function call. Notice that none of these workflows require
tuning so `tune::fit_resamples()`

can be used:

```
logistic_res <-
logistic_set %>%
workflow_map("fit_resamples", seed = 3, resamples = cv_folds,
metrics = metric_set(roc_auc))
logistic_res
```

```
## # A workflow set/tibble: 4 × 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 A_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 2 B_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 3 ratio_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 4 spatial_sign_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
```

```
collect_metrics(logistic_res) %>%
filter(.metric == "roc_auc")
```

```
## # A tibble: 4 × 9
## wflow_id .config preproc model .metric .estimator mean n std_err
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 A_logistic Prepro… formula logi… roc_auc binary 0.702 10 0.0210
## 2 B_logistic Prepro… formula logi… roc_auc binary 0.866 10 0.0151
## 3 ratio_logi… Prepro… formula logi… roc_auc binary 0.749 10 0.0164
## 4 spatial_si… Prepro… recipe logi… roc_auc binary 0.867 10 0.0176
```

We can also add the previously tuned spline results by first converting them to a workflow set then appending their rows to the results:

```
logistic_res <-
logistic_res %>%
bind_rows(
as_workflow_set(splines = spline_tune)
)
logistic_res
```

```
## # A workflow set/tibble: 5 × 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 A_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 2 B_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 3 ratio_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 4 spatial_sign_logistic <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 5 splines <tibble [1 × 4]> <opts[0]> <tune[+]>
```

There are some convenience functions to take an initial look at the results:

```
rank_results(logistic_res, rank_metric = "roc_auc") %>%
filter(.metric == "roc_auc")
```

```
## # A tibble: 8 × 9
## wflow_id .config .metric mean std_err n preprocessor model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 splines Prepro… roc_auc 0.888 0.0141 10 recipe logi… 1
## 2 splines Prepro… roc_auc 0.887 0.0137 10 recipe logi… 2
## 3 splines Prepro… roc_auc 0.886 0.0133 10 recipe logi… 3
## 4 splines Prepro… roc_auc 0.883 0.0128 10 recipe logi… 4
## 5 spatial_si… Prepro… roc_auc 0.867 0.0176 10 recipe logi… 5
## 6 B_logistic Prepro… roc_auc 0.866 0.0151 10 formula logi… 6
## 7 ratio_logi… Prepro… roc_auc 0.749 0.0164 10 formula logi… 7
## 8 A_logistic Prepro… roc_auc 0.702 0.0210 10 formula logi… 8
```

`autoplot(logistic_res, metric = "roc_auc")`

The `perf_mod()`

method for workflow sets takes the best
submodel from each workflow and then uses the standard
`tidyposterior`

analysis:

`roc_mod <- perf_mod(logistic_res, metric = "roc_auc", seed = 1, refresh = 0)`

The results of this call produces an object with an additional class
to enable some `autoplot()`

methods specific to workflow
sets. For example, the default plot shows 90% credible intervals for the
best results in each workflow:

`autoplot(roc_mod)`

Alternatively, the ROPE estimates for a given since can be computed to compare the numerically best workflow to the others. The probability of practical equivalence is shown for all results:

`autoplot(roc_mod, type = "ROPE", size = 0.025)`