Adding an engine specification field for predictor encodings

As described in [this post](https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/), there is an extreme amount of heterogeneity in how models are specified, especially as it relates to the R model formula: 

 1. **traditional R formula** is used in 95% of cases where the formula specifies the variable roles and encodes categorical data with dummy variables (e.g. `lm()`).

 1. **slightly-less-traditional R formula** happens with tree-based models (and others) where roles are declared but no dummy variables are created. `ranger()` is the best example. 

 1. **no formulas** when the underlying model function has no formula method and our code uses a formula to make the appropriate x/y data structures (`glmnet()` is the poster child for this)

 1. **non-standard formulas** don't use the standard R infrastructure and usually specify more specific roles for predictors ([GAMS](https://rdrr.io/cran/mgcv/man/gam.models.html) and [multilevel models](https://rdrr.io/cran/lme4/man/lmer.html) are the best examples). The encoding requirement depends on the model type. Such packages mostly to have their own specialized formula parsers.

Since they try to provide a uniform model interface, there are two major problems for the tidymodels packages because of this: 

* A user expects a formula method given to `parsnip` or `workflows` behaves the same way as it would have with the underlying function. (example: tidymodels/tune#151)

* For nonstandard formulas, there should be a way to easily use them in our infrastructure. (see tidymodels/workflows#34)

## Potential solution

[When engine details are specified](https://www.tidymodels.org/learn/develop/models/) we could add a descriptor for each model/engine that tells us how to handle the encoding issue. For example, the `ranger` engine could have a flag that dummy variables should not be created. 

For `parsnip`, this would affect when the user uses `fit(model, formula, data)` and the underlying function uses the x/y interface. Apart from that, it should not impact the package. 

For `workflows`, this would tell the underlying `hardhat` code what to do for what model. We might be able to give `add_formula()` the ability to use nonstandard formulas. In that case, it could capture the variables used in the formula and treat them appropriately to make the data. 

For example: 

```r
wflow <- 
  workflow() %>% 
  add_formula(Reaction ~ Days + (Days | Subject)) %>% 
  add_model(lmer_mod)
```

would determine with variables to use on each side of the `~` and use `all.vars()` to get the appropriate columns (it's actually not that simple but there are functions in `recipes` already). 

Based on the information in the model specification, the appropriate `hardhat` blueprint could be used. 

(aside: this would not eliminate the need of the extra formula slot of `add_model()` since it would be used with a recipe)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding an engine specification field for predictor encodings #290

Potential solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding an engine specification field for predictor encodings #290

Description

Potential solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions