Skip to content

add variables and special model formulas #34

@topepo

Description

@topepo

With upcoming hierarchical models, GAMS, and others, we need to make the workflow interface smoother.

Currently, it is not intuitive in a few ways:

  • There are the usual formula and the model formula (that you get via add_model()).
  • The default formula processing can eliminate the specific variables that you want to keep intact.

Historically, the model formula has always done many things: specify the variables in the model, create encodings for them, and then hand them off to the model with the appropriate analysis roles (e.g. outcome, predictor, etc).

Example

For example, if there was a parsnip hierarchical model to fit via stan or lme4, a user's initial stab would be:

library(tidymodels)

data(sleepstudy, package = "lme4")

mod <- linear_reg() %>% set_engine("stan glmer")

wflow_0 <- 
  workflow() %>% 
  # Won't work since the basic formula method makes dummy variables
  add_formula(Reaction ~ Days + (Days || Subject)) %>% 
  add_model(mod)

fit() will generate the error:

Error in Days || Subject : invalid 'y' type in 'x || y'

(which could be better)

Looking around, the formula argument to add_model() is found:

wflow_1 <- 
  workflow() %>% 
  # Make a simple formula for processing the data 
  add_formula(Reaction ~ Days + Subject) %>% 
  # Then add another formula to give to the model: 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

That ends in an error of

Error in eval(predvars, data, env) : object 'Subject' not found 

because add_formula() makes dummy variables.

Current solution

After searching a lot more, there are two options that are kludgy but work:

bp <- hardhat::default_formula_blueprint(indicators = FALSE)
wflow_2 <- 
  workflow() %>% 
  add_formula(Reaction ~ ., blueprint = bp) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

wflow_3 <- 
  workflow() %>% 
  add_recipe(recipe(Reaction ~ ., data = sleepstudy)) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

We can make this interface a lot better and intuitive.

Proposals

Some straw-man proposals:

First, let's make a function where users can tell the model what data to use, and maybe their limited roles, without doing any pre-processing:

wflow_4 <- 
  workflow() %>% 
  # Add in the data by processing through only `model.frame()` or equivalent. 
  # No other in-line functions used; just as-is:
  add_variables_asis(Reaction ~ .) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

Having two formulas might be confusing. Basic tidyselect tools could be used instead:

wflow_5 <- 
  workflow() %>% 
  # If formulas are confusing, we could use tidyselect functions
  add_variables(one_of(Reaction, Days, Subject)) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

Even though the endpoint could be achieved using current code, the existing methods are not intuitive and also not well documented in workflows.

Second, even though the model formula is tied to the model, it might be better to have a separate add function that attaches a model formula to a model specification:

wflow_6 <- 
  workflow() %>% 
  add_variables(one_of(Reaction, Days, Subject)) %>% 
  add_model(mod) %>% 
  add_model_formula(Reaction ~ Days + (Days || Subject))

A few people might want to add input: @jaredlander, @beckmart, @monicathieu, @billdenney, @emitanaka, and @Athanasiamo

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancementwipwork in progress

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions