-
Notifications
You must be signed in to change notification settings - Fork 24
Description
With upcoming hierarchical models, GAMS, and others, we need to make the workflow interface smoother.
Currently, it is not intuitive in a few ways:
- There are the usual formula and the model formula (that you get via
add_model()
). - The default formula processing can eliminate the specific variables that you want to keep intact.
Historically, the model formula has always done many things: specify the variables in the model, create encodings for them, and then hand them off to the model with the appropriate analysis roles (e.g. outcome, predictor, etc).
Example
For example, if there was a parsnip
hierarchical model to fit via stan or lme4
, a user's initial stab would be:
library(tidymodels)
data(sleepstudy, package = "lme4")
mod <- linear_reg() %>% set_engine("stan glmer")
wflow_0 <-
workflow() %>%
# Won't work since the basic formula method makes dummy variables
add_formula(Reaction ~ Days + (Days || Subject)) %>%
add_model(mod)
fit()
will generate the error:
Error in Days || Subject : invalid 'y' type in 'x || y'
(which could be better)
Looking around, the formula
argument to add_model()
is found:
wflow_1 <-
workflow() %>%
# Make a simple formula for processing the data
add_formula(Reaction ~ Days + Subject) %>%
# Then add another formula to give to the model:
add_model(mod, formula = Reaction ~ Days + (Days || Subject))
That ends in an error of
Error in eval(predvars, data, env) : object 'Subject' not found
because add_formula()
makes dummy variables.
Current solution
After searching a lot more, there are two options that are kludgy but work:
bp <- hardhat::default_formula_blueprint(indicators = FALSE)
wflow_2 <-
workflow() %>%
add_formula(Reaction ~ ., blueprint = bp) %>%
add_model(mod, formula = Reaction ~ Days + (Days || Subject))
wflow_3 <-
workflow() %>%
add_recipe(recipe(Reaction ~ ., data = sleepstudy)) %>%
add_model(mod, formula = Reaction ~ Days + (Days || Subject))
We can make this interface a lot better and intuitive.
Proposals
Some straw-man proposals:
First, let's make a function where users can tell the model what data to use, and maybe their limited roles, without doing any pre-processing:
wflow_4 <-
workflow() %>%
# Add in the data by processing through only `model.frame()` or equivalent.
# No other in-line functions used; just as-is:
add_variables_asis(Reaction ~ .) %>%
add_model(mod, formula = Reaction ~ Days + (Days || Subject))
Having two formulas might be confusing. Basic tidyselect
tools could be used instead:
wflow_5 <-
workflow() %>%
# If formulas are confusing, we could use tidyselect functions
add_variables(one_of(Reaction, Days, Subject)) %>%
add_model(mod, formula = Reaction ~ Days + (Days || Subject))
Even though the endpoint could be achieved using current code, the existing methods are not intuitive and also not well documented in workflows
.
Second, even though the model formula is tied to the model, it might be better to have a separate add function that attaches a model formula to a model specification:
wflow_6 <-
workflow() %>%
add_variables(one_of(Reaction, Days, Subject)) %>%
add_model(mod) %>%
add_model_formula(Reaction ~ Days + (Days || Subject))
A few people might want to add input: @jaredlander, @beckmart, @monicathieu, @billdenney, @emitanaka, and @Athanasiamo