Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

standardize with refit without centering #164

Open
alaindanet opened this issue Apr 12, 2022 · 6 comments
Open

standardize with refit without centering #164

alaindanet opened this issue Apr 12, 2022 · 6 comments

Comments

@alaindanet
Copy link

Thank you so much for the package!

I would like to know if it is possible to provide an option of standardization without centering when using refit method.

The rationale is that the negative and positive values of some predictive variables can have a meaningful signification (i.e. difference of price over a period), and it that case, it is valuable to only scale the variable and not center them as suggested by Andrew Gelman here and here (Gelman, 2008; actually cited in the documentation of the standardize function):

  1. subtracting the mean of each input variable and dividing by its standard deviation. (Strictly
    speaking, subtracting the mean is not necessary, but this step allows main effects to be more
    easily interpreted in the presence of interactions.)

We also center each input variable to have a mean of zero so that interactions are more
interpretable. Again, in some applications it can make sense for variables to be centered around
some particular baseline value, but we believe our automatic procedure is better than the current
default of using whatever value happens to be zero on the scale of the data, which all too commonly
results in absurdities such as age = 0 years or party identification = 0 on a 1–7 scale.

In the case where negative and positive values of predictor variables have different meaning, I believe that the centering can change the meaning of the regression coefficients.

I realized that with my own data analysis where a positive coefficient become negative with centering, with the type of explicative variable that I mentionned above.

@mattansb
Copy link
Member

Do you have an example (reprex) --without an interaction-- where centering vs non-centering changes the signs on the coefficients (other than the intercept)? It really shouldn't...

@strengejacke
Copy link
Member

Maybe standardizing the data before fitting the model can help, you have options to control the reference for centering and dispersion: https://easystats.github.io/datawizard/reference/standardize.html

@alaindanet
Copy link
Author

alaindanet commented Apr 12, 2022

@mattansb It is a two way interactions that display change of sign, it happens where the predictors log2 ratio over temporal data, i.e. log2(x1/x0). I have not so much time right now to reproduce this but it took me a long time to figure out why the results would change.

@strengejacke I agree.

I was thinking about an option in compare_models() function that I was using a bit too automatically.

It is just that this discrepancy made me realize that I should really think about I am doing when standardizing variables. As highlighted in my first post, centering is may be misleading in some cases.

To elaborate a bit, when standardizing coefficients, we think about the formula :
$$r_\delta = \beta \dfrac{\sigma_x}{\sigma_y}$$. Standardization with "refit" option leads to centering and scaling of variables which is not so clear right now from the compare_models() function, I guess that some users (including me) are not thinking about centering.

It is not a big deal, but I wanted to raise this issue, to see if a sentence could be added to the documentation of compare_models(), or an option to specify if variables should be center or not, or at least add a ref to Gelman regarding the difference between scaling and centering/scaling.

This said, thank you again for your great package.

@mattansb
Copy link
Member

Indeed, if there is an interaction, the simple slopes will change after centering - this is usually something people want (to have the simple slopes represent "main effects").

As @strengejacke pointed out, if you want more fine-grain control, you can standardize each variable as you see fit manually, prior to model fitting.

Seeing how the back-end function (datawizard::standardize.data.frame()) is setup, I don't see the suggested functionality being added right now.

@alaindanet
Copy link
Author

Well @mattansb , I agree with you in the case where the 0 values of your predictive variable give little insight as age in the epidemiological study on adult population.
But in the case were the 0 values are of interest like a variable describing changes of weight, it sound more relevant to not center the variable to the mean change of weight.

Here again, I quote Gelman (2008):

We also center each input variable to have a mean of zero so that interactions are more
interpretable. Again, in some applications it can make sense for variables to be centered around
some particular baseline value, but we believe our automatic procedure is better than the current
default of using whatever value happens to be zero on the scale of the data, which all too commonly
results in absurdities such as age = 0 years or party identification = 0 on a 1–7 scale. Even with
such scaling, the correct interpretation of the model can be untangled from the regression by
pulling out the right combination of coefficients (for example, evaluating interactions at different
plausible values of age such as 20, 40, and 60); the advantage of our procedure is that the default
outputs in the regression table can be compared and understood in a consistent way.

That is fine that is low priority! At least, people who have questions about centering/scaling may end up here and read Gelman (2008).

Thank you so much!

@mattansb
Copy link
Member

@alaindanet I am aware of these points, even though their application is less commonly used - yes, ideally people would understand their scales and units of measure and would center (or not) variables around sensible values that are derived from domain specific knowledge.
But if this is the case, effectsize::standardize() wouldn't be used anyway 😉

@mattansb mattansb reopened this Apr 17, 2022
@mattansb mattansb transferred this issue from easystats/effectsize May 3, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

5 participants