-
Notifications
You must be signed in to change notification settings - Fork 91
Adding an engine specification field for predictor encodings #290
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Labels
feature
a feature request or enhancement
Comments
I'm working through these and I don't have a working Spark installation set up. Do we know if |
It would appear that they can be left as-is: library(sparklyr)
sc <- spark_connect(master="local")
iris <- copy_to(sc, iris, "iris")
ml_linear_regression(iris, Sepal_Length ~ .)
#> Formula: Sepal_Length ~ .
#>
#> Coefficients:
#> (Intercept) Sepal_Width Petal_Length Petal_Width
#> 2.1712663 0.4958889 0.8292439 -0.3151552
#> Species_versicolor Species_virginica
#> -0.7235620 -1.0234978
ml_decision_tree(iris, Sepal_Length ~ .)
#> Formula: Sepal_Length ~ .
#>
#> DecisionTreeRegressionModel (uid=decision_tree_95324c8af457) of depth 5 with 55 nodes Created on 2020-05-18 by the reprex package (v0.3.0) |
This was referenced May 21, 2020
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
# for free
to subscribe to this conversation on GitHub.
Already have an account?
#.
As described in this post, there is an extreme amount of heterogeneity in how models are specified, especially as it relates to the R model formula:
traditional R formula is used in 95% of cases where the formula specifies the variable roles and encodes categorical data with dummy variables (e.g.
lm()
).slightly-less-traditional R formula happens with tree-based models (and others) where roles are declared but no dummy variables are created.
ranger()
is the best example.no formulas when the underlying model function has no formula method and our code uses a formula to make the appropriate x/y data structures (
glmnet()
is the poster child for this)non-standard formulas don't use the standard R infrastructure and usually specify more specific roles for predictors (GAMS and multilevel models are the best examples). The encoding requirement depends on the model type. Such packages mostly to have their own specialized formula parsers.
Since they try to provide a uniform model interface, there are two major problems for the tidymodels packages because of this:
A user expects a formula method given to
parsnip
orworkflows
behaves the same way as it would have with the underlying function. (example: Issues with behind-the-scenes, surprising variable pre-processing and ranger package for random forests tune#151)For nonstandard formulas, there should be a way to easily use them in our infrastructure. (see add variables and special model formulas workflows#34)
Potential solution
When engine details are specified we could add a descriptor for each model/engine that tells us how to handle the encoding issue. For example, the
ranger
engine could have a flag that dummy variables should not be created.For
parsnip
, this would affect when the user usesfit(model, formula, data)
and the underlying function uses the x/y interface. Apart from that, it should not impact the package.For
workflows
, this would tell the underlyinghardhat
code what to do for what model. We might be able to giveadd_formula()
the ability to use nonstandard formulas. In that case, it could capture the variables used in the formula and treat them appropriately to make the data.For example:
would determine with variables to use on each side of the
~
and useall.vars()
to get the appropriate columns (it's actually not that simple but there are functions inrecipes
already).Based on the information in the model specification, the appropriate
hardhat
blueprint could be used.(aside: this would not eliminate the need of the extra formula slot of
add_model()
since it would be used with a recipe)The text was updated successfully, but these errors were encountered: