Friday, February 7, 2014

High Dimensional Statistics

Typical problem in statistics is a linear model, trying to predict the out put yi through input xi's assuming the model: yi = b0 + b1x1 + b2x2 + ... + bpxp. 

High dimensional statistics only means p is very large. Thus one needs a tool to find a subset of input that best predict yi. If we include all input in the model, it will overfit. Couple approaches to fix this:

  1. Subset selection
    • Let say we want to reduce the dimension to 10, we try all combination of 10 input, see which combination has the largest R-square. Choose them as our model. The good side: we will get the best combination. The bad: computation is exponentially expensive
    • Forward stagewise / backward stagewise: Find the x that is most correlated with y, regress y on x, take the residual. Now find the x out of the remaining x that is most correlated with this residual. Keep doing like this say 10 times. The good: very efficient computationally, often get a good result too. The bad: not guarantee to have best result. 
  2. Continuous shrinkage: subset selection method is discrete, which means there are either 10 variable, or 11 variable included. Now we extend to continuous shrinkage, where all the beta are shrinked, thus reducing the degree of freedom
    • Ridge regression: If the solution to Least square is to minimize (1/2n)(y-Xb)^t (y-Xb), bridge regression minimize that sum plus lambda*(b1^2 + b2^2 + ... + bp ^ 2). For the sake of simplicity, assuming all the inputs are independent, then ridge regression will reduce each beta to (1/(1+lambda))*OLSbeta. The good thing about ridge is it has an analytical solution in matrices term. 
    • Lasso: now the lasso is similar to bridge, but instead the penalty now is of first degree: lambda*(|b1| + |b2| + ... + |bp|). We note that ridge regression penalize the large beta too much. The lasso on the other hand, assuming orthogonality of inputs, will penalize each beta an equal amount of lambda. Great thing about lambda is computationally cheap (=OLS), very well understood in term of doing inference, degree of freedom etc.
    • Bridge regression: is a generalized version of lasso and ridge, where the power can be anything. Not that if the power >= 1, then the loss function is convex, thus easy to minimize. For power <1 the function is not. For degree > 1, the function is differentiable. But for degree > 1, penalty will not set the beta to zero. There is the Elastic Net, which is a convex combination of Lasso and Bridge, which has oracle properties.
    • Adaptive Lasso: Similar to Lasso, just give a different weight to each beta. So we can infact do lasso twice, the first time to find the weight. The good things is it can have oracle properties, which is amazing :-). 
    • Other methods like SACD... To choose parameter lambda, one uses Cross validation. Or theoretical approach would be AIC, BIC. Computation for the Lasso is done by the LAR algorithm. 

No comments:

Post a Comment