Skip to content

Adding an engine specification field for predictor encodings #290

Closed
@topepo

Description

@topepo

As described in this post, there is an extreme amount of heterogeneity in how models are specified, especially as it relates to the R model formula:

  1. traditional R formula is used in 95% of cases where the formula specifies the variable roles and encodes categorical data with dummy variables (e.g. lm()).

  2. slightly-less-traditional R formula happens with tree-based models (and others) where roles are declared but no dummy variables are created. ranger() is the best example.

  3. no formulas when the underlying model function has no formula method and our code uses a formula to make the appropriate x/y data structures (glmnet() is the poster child for this)

  4. non-standard formulas don't use the standard R infrastructure and usually specify more specific roles for predictors (GAMS and multilevel models are the best examples). The encoding requirement depends on the model type. Such packages mostly to have their own specialized formula parsers.

Since they try to provide a uniform model interface, there are two major problems for the tidymodels packages because of this:

Potential solution

When engine details are specified we could add a descriptor for each model/engine that tells us how to handle the encoding issue. For example, the ranger engine could have a flag that dummy variables should not be created.

For parsnip, this would affect when the user uses fit(model, formula, data) and the underlying function uses the x/y interface. Apart from that, it should not impact the package.

For workflows, this would tell the underlying hardhat code what to do for what model. We might be able to give add_formula() the ability to use nonstandard formulas. In that case, it could capture the variables used in the formula and treat them appropriately to make the data.

For example:

wflow <- 
  workflow() %>% 
  add_formula(Reaction ~ Days + (Days | Subject)) %>% 
  add_model(lmer_mod)

would determine with variables to use on each side of the ~ and use all.vars() to get the appropriate columns (it's actually not that simple but there are functions in recipes already).

Based on the information in the model specification, the appropriate hardhat blueprint could be used.

(aside: this would not eliminate the need of the extra formula slot of add_model() since it would be used with a recipe)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions