Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Tie Estimation Error to Variance #30

Open
davidrosenberg opened this issue Jan 15, 2017 · 6 comments
Open

Tie Estimation Error to Variance #30

davidrosenberg opened this issue Jan 15, 2017 · 6 comments

Comments

@davidrosenberg
Copy link
Owner

Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).

Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf

@brett1479
Copy link
Collaborator

I thought about this back when I watched the videos. For parametric estimators, you can talk about your uncertainty in the parameter values (I made a concept-check question about the covariance matrix of \hat{w} for least squares linear regression, and ridge regression). In general, I think L2 methods are the way, but I don't have a reference.

@vakobzar
Copy link
Contributor

Hi David,

What do you think about the following visualizations for the excess risk decomposition?

  1. Decision tree -- expand on the classification problem from the slides:
    (a) 2D plots similar to p 30, 31 of your slides Excess Risk Decomposition with a few different sample sizes: We can plot the depth of the tree on the x-axis and the error on the y-axis, decomposed into estimation, approximation and optimization errors by colored bar.
    (b) Also 3D plots showing the depth on the x-axis, the sample size on the y-axis and the error on the z-axis, decomposed into 3 surfaces representing estimation, approximation and optimization errors.

  2. Linear model -- y(x) = a+bx_1+cx_2 where x = (x_1,x_2). We sample from a distribution y(x)=w_0+w_1 x+ \epsilon where \epsilon \sim N(0, 2^2). We plot the clustering of w = (a, b, c) representing the estimation error for different sample sizes.

  3. Ridge regression -- We can plot the error vs complexity. Do you have a particular distribution to
    sample from in mind?

Thank you.

Best,
Vlad

@vakobzar
Copy link
Contributor

Good evening David,

  1. I posted a 2D animation for GD with fixed step at
    https://github.com/davidrosenberg/mlcourse-homework/blob/master/in-prep/recitations/gd_fixed_step_2d.ipynb
    Please let me know if this is what you had in mind. I will overlay the other gradient descent methods we discussed tomorrow (Friday).

  2. For the demo of the distribution of minibatch SGD directions, are you OK if we use a ridge regression model and sample from a linear model with additive Gaussian noise? Also did you have any particular step size in mind, e.g., 1/n?

Thank you very much.

Best,
Vlad

@davidrosenberg
Copy link
Owner Author

Hi Vlad -- 2d animation looks good. for minibatch SGD, what about just linear regression (no ridge penalty)? Linear model with additive Gaussian noise sounds fine. Let's start with a fixed step size. (i.e. a fixed multiplier of the minibatch gradient)

@vakobzar
Copy link
Contributor

David, Thank you! I think that we need the step size eta_t to converge to zero for the minibatch method to converge. When I run the minibatch code with fixed step size, it doesn't converge. It does converge when I try 1/t -- are you OK with this or perhaps I am misunderstanding something.

@davidrosenberg
Copy link
Owner Author

davidrosenberg commented Jan 23, 2017 via email

@davidrosenberg davidrosenberg removed this from the Excess risk decomposition milestone Jan 28, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants