Skip to content

Adding an engine specification field for predictor encodings #290

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
topepo opened this issue Apr 28, 2020 · 3 comments · Fixed by #319
Closed

Adding an engine specification field for predictor encodings #290

topepo opened this issue Apr 28, 2020 · 3 comments · Fixed by #319
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Apr 28, 2020

As described in this post, there is an extreme amount of heterogeneity in how models are specified, especially as it relates to the R model formula:

  1. traditional R formula is used in 95% of cases where the formula specifies the variable roles and encodes categorical data with dummy variables (e.g. lm()).

  2. slightly-less-traditional R formula happens with tree-based models (and others) where roles are declared but no dummy variables are created. ranger() is the best example.

  3. no formulas when the underlying model function has no formula method and our code uses a formula to make the appropriate x/y data structures (glmnet() is the poster child for this)

  4. non-standard formulas don't use the standard R infrastructure and usually specify more specific roles for predictors (GAMS and multilevel models are the best examples). The encoding requirement depends on the model type. Such packages mostly to have their own specialized formula parsers.

Since they try to provide a uniform model interface, there are two major problems for the tidymodels packages because of this:

Potential solution

When engine details are specified we could add a descriptor for each model/engine that tells us how to handle the encoding issue. For example, the ranger engine could have a flag that dummy variables should not be created.

For parsnip, this would affect when the user uses fit(model, formula, data) and the underlying function uses the x/y interface. Apart from that, it should not impact the package.

For workflows, this would tell the underlying hardhat code what to do for what model. We might be able to give add_formula() the ability to use nonstandard formulas. In that case, it could capture the variables used in the formula and treat them appropriately to make the data.

For example:

wflow <- 
  workflow() %>% 
  add_formula(Reaction ~ Days + (Days | Subject)) %>% 
  add_model(lmer_mod)

would determine with variables to use on each side of the ~ and use all.vars() to get the appropriate columns (it's actually not that simple but there are functions in recipes already).

Based on the information in the model specification, the appropriate hardhat blueprint could be used.

(aside: this would not eliminate the need of the extra formula slot of add_model() since it would be used with a recipe)

@juliasilge
Copy link
Member

I'm working through these and I don't have a working Spark installation set up. Do we know if predictor_indicators = TRUE for Spark?

@topepo
Copy link
Member Author

topepo commented May 19, 2020

It would appear that they can be left as-is:

library(sparklyr)

sc <- spark_connect(master="local")
iris <- copy_to(sc, iris, "iris")
ml_linear_regression(iris, Sepal_Length ~ .)
#> Formula: Sepal_Length ~ .
#> 
#> Coefficients:
#>        (Intercept)        Sepal_Width       Petal_Length        Petal_Width 
#>          2.1712663          0.4958889          0.8292439         -0.3151552 
#> Species_versicolor  Species_virginica 
#>         -0.7235620         -1.0234978

ml_decision_tree(iris, Sepal_Length ~ .)
#> Formula: Sepal_Length ~ .
#> 
#> DecisionTreeRegressionModel (uid=decision_tree_95324c8af457) of depth 5 with 55 nodes

Created on 2020-05-18 by the reprex package (v0.3.0)

@github-actions
Copy link

github-actions bot commented Mar 7, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 7, 2021
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants