-
-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Unclear wording in "Composing Optimizers" section of docs #1627
Comments
Short term fix: perhaps the default value for the LR in Long term rant:
Thank you for taking the effort to type this up, because what you are highlighting here is something that I've brought up before. Schedules and optimizers are not the same thing. It's a cute trick that certain schedules can be "composed" with our optimizers, but I really strongly feel we should fix this by making schedules their own distinct thing. Schedulers (in general) wrap an optimizer, they do not compose with them. The fact that |
1 should be fine for now |
1628: Update "Composing Optimisers" docs r=darsnack a=StevenWhitaker Addresses #1627 (perhaps only partially). Use `1` instead of `0.001` for the first argument of `ExpDecay` in the example, so that the sentence following the example, i.e., > Here we apply exponential decay to the `Descent` optimiser. makes more sense. It was also [suggested](#1627 (comment)) in the linked issue that it might be worth changing the default learning rate of `ExpDecay` to `1`. Since this PR doesn't address that, I'm not sure merging this PR should necessarily close the issue. Co-authored-by: StevenWhitaker <steventwhitaker@gmail.com>
In the Composing Optimizers section of the docs it states:
I think that last sentence is a bit misleading to someone who is not very familiar with Flux. When I read "apply exponential decay to the
Descent
optimiser", I take that to mean that we use the learning rate defined byDescent
, but update it according to some schedule defined byExpDecay
. But in reality, if I understand correctly, the above example from the docs uses an initial learning rate that is actually0.0001
(because the default learning rate ofDescent
is0.1
), not0.1
like I would expect.(Somewhat tangential note: One possible reason for the confusion might be the fact that
ExpDecay
by itself can be used to do gradient descent, but that isn't conveyed by the nameExpDecay
. I would think that something calledExpDecay
would have to be paired with an optimizer, and would not have its own learning rate.)Two possible ways to improve the clarity of the docs:
Update the example with
and leave the rest of the wording as is.
Use a different example, especially since
is equivalent to
(if I understand correctly), so the current example doesn't seem as useful as another might be.
I am happy to open a PR to make the simple change suggested in 1., but I don't know Flux well enough to do 2.
The text was updated successfully, but these errors were encountered: