Description
As described in this post, there is an extreme amount of heterogeneity in how models are specified, especially as it relates to the R model formula:
-
traditional R formula is used in 95% of cases where the formula specifies the variable roles and encodes categorical data with dummy variables (e.g.
lm()
). -
slightly-less-traditional R formula happens with tree-based models (and others) where roles are declared but no dummy variables are created.
ranger()
is the best example. -
no formulas when the underlying model function has no formula method and our code uses a formula to make the appropriate x/y data structures (
glmnet()
is the poster child for this) -
non-standard formulas don't use the standard R infrastructure and usually specify more specific roles for predictors (GAMS and multilevel models are the best examples). The encoding requirement depends on the model type. Such packages mostly to have their own specialized formula parsers.
Since they try to provide a uniform model interface, there are two major problems for the tidymodels packages because of this:
-
A user expects a formula method given to
parsnip
orworkflows
behaves the same way as it would have with the underlying function. (example: Issues with behind-the-scenes, surprising variable pre-processing and ranger package for random forests tune#151) -
For nonstandard formulas, there should be a way to easily use them in our infrastructure. (see add variables and special model formulas workflows#34)
Potential solution
When engine details are specified we could add a descriptor for each model/engine that tells us how to handle the encoding issue. For example, the ranger
engine could have a flag that dummy variables should not be created.
For parsnip
, this would affect when the user uses fit(model, formula, data)
and the underlying function uses the x/y interface. Apart from that, it should not impact the package.
For workflows
, this would tell the underlying hardhat
code what to do for what model. We might be able to give add_formula()
the ability to use nonstandard formulas. In that case, it could capture the variables used in the formula and treat them appropriately to make the data.
For example:
wflow <-
workflow() %>%
add_formula(Reaction ~ Days + (Days | Subject)) %>%
add_model(lmer_mod)
would determine with variables to use on each side of the ~
and use all.vars()
to get the appropriate columns (it's actually not that simple but there are functions in recipes
already).
Based on the information in the model specification, the appropriate hardhat
blueprint could be used.
(aside: this would not eliminate the need of the extra formula slot of add_model()
since it would be used with a recipe)