Sunday, January 11, 2015

Key methods in Machine Learning

Bagging: as in Random Forest, run multiple trees, each with a different bootstrap sample of the data. So for example, we have 100 data points, we sample with replacement 100 point from the original dataset (so some observation might be chosen twice) to fit a tree. On average 1/e ~ two third of the data points will be picked. The unpicked can be used to do out of bag prediction, which is similar to cross validation. 

Random Forest: in addition to bagging (boot strap vertically the input matrix), random forest also perform bootstrap horizontally the input matrix. Each tree is run with some randomly chosen columns (inputs). This reduce the correlation among different tree, by that increasing the performance of the final ensemble. 

Boosting: If in bagging, all the simple trees are created equal, in boosting, they are not. Boosting is done sequentially, after the first tree is fitted, the second tree is made to focus on the wrongly classified examples done by the first. And then the next tree keep focusing on the mistakes made by previous trees. 

Ensemble: Both Random Forest and Gradient Boosting Machine are example of ensemble methods. In general, one can have ensemble of ensemble of ensemble (as the winning solution to the Netflix prize). For classification, an ensemble can be done as the majority vote (random forest does this). For regression, an ensemble can be done by averaging individual regressor. More sophisticated methods (but also prone to overfitting) are linear regression, ridge, lasso, Bayesian averaging. I like non negative least square. 

Margin maximization (as in SVM): according to Yann LeCun, margin maximization is similar to L2 regularization. 

Feature engineering: including polynomial terms (x^2, sqrt(x), log(x)), interactive term x1 x2. Wavelet transform seem to be useful in vision tasks. SVM kernel can be thought of as feature engineering. 

Overall, the portfolio of (out of the box) machine learning tools are:
1. Random Forest
2. Gradient Boosting Machine (and other boosting methods i.e. AdaBoost)
3. SVM
4. Lasso / Ridge / Elastic Net / Linear / Logistic / GLM
5. Gaussian Process 
6. LDA / QDA
7. NaiveBayes
8. Neural Network

5, 6, 7 are generative models. The rest are discriminative models (right?). 3 and 5 are kernel methods. 1, 2 are tree based methods.  

No comments:

Post a Comment