-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Tie Estimation Error to Variance #30
Comments
I thought about this back when I watched the videos. For parametric estimators, you can talk about your uncertainty in the parameter values (I made a concept-check question about the covariance matrix of \hat{w} for least squares linear regression, and ridge regression). In general, I think L2 methods are the way, but I don't have a reference. |
Hi David, What do you think about the following visualizations for the excess risk decomposition?
Thank you. Best, |
Good evening David,
Thank you very much. Best, |
Hi Vlad -- 2d animation looks good. for minibatch SGD, what about just linear regression (no ridge penalty)? Linear model with additive Gaussian noise sounds fine. Let's start with a fixed step size. (i.e. a fixed multiplier of the minibatch gradient) |
David, Thank you! I think that we need the step size eta_t to converge to zero for the minibatch method to converge. When I run the minibatch code with fixed step size, it doesn't converge. It does converge when I try 1/t -- are you OK with this or perhaps I am misunderstanding something. |
I think it's fine.
…Sent from my iPhone
On Jan 22, 2017, at 5:51 PM, vakobzar ***@***.***> wrote:
David, Thank you! I think that we need the step size eta_t to converge to zero for the minibatch method to converge. When I run the minibatch code with fixed step size, it doesn't converge. It does converge when I try 1/t -- are you OK with this or perhaps I am misunderstanding something.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).
Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf
The text was updated successfully, but these errors were encountered: