Me Learning

Thursday, April 28, 2016

Setup Python github project

1. pytest: let say we have a file linear_regression.py under current directory, we will need to create a file under tests/test_abc.py, that imports linear_regression.py, and do testing. We then can run
> py.test --cov=packagename --pyargs packagename
to perform unit testing. It will report coverage percentage, which roughly means how many lines of codes are covered with unit testing.
2. doctest: under the definition of a function, we can include document testing, e.g.

def f(x):
"""
Examples:
>>> f(2)
4
"""
return x**2

Then we can run doctest that check for correct result
3. style test: to make sure that our codes are in good styles, we can do
> py.test --pep8 --flakes --pyargs hyperemble
4. Travis: to make sure that our codes work across platform, across different version of Python, we can use travis. Upload the code to github, and travis will perform test across different platform for correctness

1. Create a github repository with README.md and license,
2. Create a setup.py
3. Create a Makefile that contains make style and make test
4. Add Travis CI support
5. Add requirements.txt for the require packages

Wednesday, April 13, 2016

Python Vim

The hardest part is to get Python work with Vim for autocompletion

Install Xcode, Command Line Tools, and home-brew, see home-brew
Reinstall system python and vim here
Install Vundle, and autocomplete by here

Git Merge at last

So finally, I learn git merge.

Let say we do:

git init
git add abc.py
git commit -m "Initial Commit"
Now here is the image

Now master is the name of the default branch. HEAD is always where you are at the moment. Origin is the default remote name (in case you have a github backup).
Now if we add some file or make change to abc.py and commit,
git add abc.py
git commit -m "Commit 2"
Then we will have the image

Both master and HEAD move to the new commit. Now let's say we have a bug in C2, and want to create a branch to fix this bug. We can do

git checkout -b bug1
git add abc.py # with attempted bug fixed
git commit -m "Trying to fix bug"

This will be the image

So master is at old place, HEAD is now pointing to new label "bug1", which points at new commit. Meanwhile, we can go back at master and make some update

git checkout master
git add abc.py
git commit -m "Add new feature. While the bug is still there"

Here is the mental image

Finally, we want to merge C4 and C3. We can do

git merge bug1

This will gives an error if there is conflict. It will automatically add to the abc.py file lines that look like
>>>>>>>>>
this line appear in C4
=========
but this line appears in C4
<<<<<<<<<

We will open that abc.py file, resolve the conflict by remove those code blocks. Then we can
git add abc.py
git commit -m "Successful merge"

Now the two branches will be merged. And the image is

A couple notes: when you do git checkout, you can checkout a commit name, e.g. C1, C2. This will result in a complain from git called headless. HEAD should often be pointed to a label, e.g. master or bug1.
Another note, if we try to merge two nodes, and on of them is the parent of the other, then it will just fastforward to the more current node without any conflict.
Final note, if you want to create a branch from old node/commit, you can do

git branch branchname <sha1-of-commit>

Monday, February 29, 2016

Auto-completion for Python + Vim

By default, in Vim, one can do Ctrl + P, Ctrl + N, that will do autocompletion based on existing keywords that we have typed in the document.
One can use pydiction, which have a predefined dictionary of autocomplete words. For this method, we can autocomplete sklearn modules, but not variable name like the first case.
Also, we would like to autocomplete file path as well.

It seems YouCompleteMe is a good package for this, even though setting it up is very time consuming. I still don't know what make it work. Suggestion is to follow this link and reinstall vim and python
http://www.oschrenk.com/vim-youcompleteme-and-python/

brew install vim
brew install python

Also, add a line in ~/.vimrc
Bundle 'Valloric/YouCompleteMe'
then open vim and run
":BundleInstall"

Then follow the guide on YouCompleteMe github.
These links are helpful
https://github.com/Valloric/YouCompleteMe#user-guide
http://susu.github.io/posts/clang-completion-in-vim-with-youcompleteme/
http://www.alexeyshmalko.com/2014/youcompleteme-ultimate-autocomplete-plugin-for-vim/
https://news.ycombinator.com/item?id=5169062

PS: This could be the ultimate guide
https://blog.dbrgn.ch/2013/5/27/using-jedi-with-ymc/

Monday, March 9, 2015

Machine Learning on Classic Dataset

MNIST

Logistic Regression: 7.63% Error - Very fast - pylearn2
Random Forest: 3% Error - Very fast (sklearn)
MLP - 2% Error

CIFAR10

Logistic Regression: 17% Accuracy
PCA + Logistic Regression: 39% Accuracy
Adaboost: 40% Accuracy
Random Forest: 49% Accuracy
GBM: 50% Accuracy (very slow)

Saturday, February 21, 2015

Learned HTML by accident

When learning how to parse text data for a machine learning problem, I accidentally learn HTML. Not that hard. At first I thought it was XML. A paragraph is wrapped in <p>...</p>. A block of code for example is done by <pre><code>...</code></pre>. Then you can include Latex if supported, html link, image. All by the similar format as above.

Regular Expression

Regular Expression is a pain in the something. I have to learn it today. I better write out before forgetting everything. Here is the link I follow Regex Python.

The main command is:\n

re.sub('abc', '', 'abc111abc')

which act similar to .replace() in Python. A couple special characters:

"." means any character except newline character. To get new line we include the flag re.S at the end of re.sub
"*" means repeat the previous character zero time, one time, or any time
"^" somehow means not some character. It is also used to denote beginning of line

Let see some example:


re.sub('abc.*(def).*xyz', '', '222abc1def11xyz333')
# 222333
re.sub('abc.*(def).*xyz', '', '222abc1de1f11xyz333')
# '222abc1de1f11xyz333'

means search for a pattern starts with "abc", next with any character or no character, does contain the phrase "def" somewhere in the middle, continue with any character and ends with "xyz". Now this kind of search is greedy, it try to find the longest patter possible.


re.sub('abc((?!def).)*xyz', '', '222abc1def11xyz333')
# '222abc1def11xyz333'

means almost the same, except that the phrase "def" must not be somewhere in the middle. Now if we want to search thing closest possible - being lazy, instead of greedy, we put a question mark:


re.sub('abc((?!dee).)*?xyz', '', '000abc111def222xyz333xyz')
# 000333xyz

means search for the shortest pattern that starts with 'abc', end with 'xyz' and does not contain 'dee'. Now there is the symbol ^ to denote the beginning of line, $ to denote the end, re.sub('^abc.*(def).*xyz$', '', '222abc1def11xyz333') will not match anymore. In conclusion, we have


def StripCBlock(s, w1 = 'pre', w2 = 'pre'):
    return re.sub('^' + w1 + '((?!' + w1 + ').)*' + w2 + '$', '', s,  
                  flags = re.MULTILINE | re.S)
def StripInline(s, w1 = 'code', w2 = 'code'):
    return re.sub(w1 + '((?!' + w1 + ').)*?' + w2, '', s,  
                  flags = re.MULTILINE | re.S)
re.sub('[_/\-,\.]',' ', s)
re.sub('[\t\n\r\f\v]',' ', s)
re.sub('[^a-zA-Z ]',' ',s).lower()
re.sub('\s+', ' ', s).strip()

The first function will search for pattern that starts with w1 (right before w1 is a newline, which is why "^" was in front), then go as far as possible and ends with w2, but must not contain w1 in the middle. The $ means w2 should be right in front of a newline character. The first flag is for searching to work with newline in the middle. The second flag is for "." to include newline. \n The second function will search for any pattern that starts with w1, ends with w2 as soon as possible, and must not contain w1 in the middle. The next line replace any of the special character to space. The line after that replace any of the white space to a space. The line after keeps only character and space, and convert uppercase to lowercase. The final line convert multiple space into one single space.

Sunday, January 11, 2015

Deep Learning

Deep Learning seems to be something that is hard to understand, comparing with other machine learning methods.
Here are the basic ideas that I've got:

1. Unsupervised Pre-training (using Restricted Boltzmann Machine (RBM))
a. At the bottom layer is your input.
b. Generate the first hidden layer using a RBM. To understand what is RBM, follow Intro to RBM
c. Using the newly generated hidden layer, generate the next hidden layer in the same manner as b. Keep doing as many layers as you want.
d. Now the top hidden layer is connected with the output via logistic regression for example.
2. After all the weight are initialized, use standard back propagation to fine tune the weights. Back propagation is just the usual gradient descent, having a new name emphasizing the multiple layers (chain rule of derivative).

Instead of RBM, one can use auto-encoder, denoising auto-encoder, sparse auto-encoder. The idea of these supervised methods are similar to PCA, in that they try to find a representation of the inputs, that is hopefully more useful in predicting the output. Unlike PCA which is linear, then a new layer of PCA is still linear and is not helpful, these methods are not linear. For example in vision task, if the input (lowest layer) is the pixel intensity. The next layer might learn to find the edges (similar to Gabor wavelet), then the next layer might represent figures, then the next layer learn objects and so on.

These methods can be thought of as automatic feature engineering.

The tutorial codes at Deep Learning Tutorial is helpful.

Key methods in Machine Learning

Bagging: as in Random Forest, run multiple trees, each with a different bootstrap sample of the data. So for example, we have 100 data points, we sample with replacement 100 point from the original dataset (so some observation might be chosen twice) to fit a tree. On average 1/e ~ two third of the data points will be picked. The unpicked can be used to do out of bag prediction, which is similar to cross validation.

Random Forest: in addition to bagging (boot strap vertically the input matrix), random forest also perform bootstrap horizontally the input matrix. Each tree is run with some randomly chosen columns (inputs). This reduce the correlation among different tree, by that increasing the performance of the final ensemble.

Boosting: If in bagging, all the simple trees are created equal, in boosting, they are not. Boosting is done sequentially, after the first tree is fitted, the second tree is made to focus on the wrongly classified examples done by the first. And then the next tree keep focusing on the mistakes made by previous trees.

Ensemble: Both Random Forest and Gradient Boosting Machine are example of ensemble methods. In general, one can have ensemble of ensemble of ensemble (as the winning solution to the Netflix prize). For classification, an ensemble can be done as the majority vote (random forest does this). For regression, an ensemble can be done by averaging individual regressor. More sophisticated methods (but also prone to overfitting) are linear regression, ridge, lasso, Bayesian averaging. I like non negative least square.

Margin maximization (as in SVM): according to Yann LeCun, margin maximization is similar to L2 regularization.

Feature engineering: including polynomial terms (x^2, sqrt(x), log(x)), interactive term x1 x2. Wavelet transform seem to be useful in vision tasks. SVM kernel can be thought of as feature engineering.

Overall, the portfolio of (out of the box) machine learning tools are:

1. Random Forest

2. Gradient Boosting Machine (and other boosting methods i.e. AdaBoost)

3. SVM

4. Lasso / Ridge / Elastic Net / Linear / Logistic / GLM

5. Gaussian Process

6. LDA / QDA

7. NaiveBayes

8. Neural Network

5, 6, 7 are generative models. The rest are discriminative models (right?). 3 and 5 are kernel methods. 1, 2 are tree based methods.

Machine Learning - Prediction

A summary of what have worked well for me in one classification task and one regression task.
In short, Random Forest is the winner in both tasks.

1. First task
Description: cloud detection. Satellite pictures of a particular region in the Artic from 5 different angle are provided. One is then asked to classify each pixel whether there is cloud there or not. This is a classification problem, with expert labels. Three images each 340 by 340 pixel are provided. For each pixel, there are 5 raw inputs, each is the pixel intensity from a particular angle. There are also 3 provided engineered features.
X dimension: 100000 by 8
Y dimension: 100000 by 1

Example of an image from one angle:

And the expert label: where blue means cloud, red means no cloud, and green means not sure.

b. Result: Random Forest works the best. The box plot the Area under Curve of each classification algorithm, from 200 different cross validation.

In term of running time:

2. Second Task:

Description: The input is 1750 images, each of dimension 128 by 128 pixel. There is a human subject in the experiment. He is shown these images, and the fMRI signal for 20 different regions of his brain is recorded. In short, fMRI measures the activity (blood flow) of the brain. The task is to use the input, to predict the fMRI response.

X dimension: 1750 by 16384

Y dimension: 20 by 16384

In our problem, the input X is transformed using Gabor wavelet.

An example of images shown to subject

Result: Random Forest and Gradient Boosting Machine are the best performing methods

Where Full means using full set of inputs. Part means we pick the best 5% of inputs, using Lasso (similar to using correlation ranking).

For ease of comparison, here is the result table:

Conclusion: Random Forest seems to work very well. Gradient Boosting Machine is also good. SVM and Neural Network did not work well for me.

Wednesday, December 3, 2014

R System Environment Variables (SEV)

1. Fix bug of not recognizing pdflatex in Rstudio when using knitr

In R, SEV can be obtained by:

> Sys.getenv()

> Sys.getenv(PATH)

> Sys.setenv(path.to.dropbox="~/Dropbox/")

> Sys.unsetenv(path.to.dropbox)

The first will return a list of all SEV. Some of them are the same, for example in my case there are two SEV with the same name PATH. This cause a bug in knitr that makes Rstudio not recognizing pdflatex. Removing the wrong one fixes the bug.

2. Automate the process of setting working directory in R. For GUI R, one can set system environment as above. For command line R. One can set SEV in bash_profile. E.g.

> vi ~/.bash_profile

> export path.to.dropbox="~/Dropbox/"

Now with this set on both GUI R and command line R. We can set working environment on both by a single command:

setwd(Sys.getenv("path.to.dropbox"))

3. Modifying a .sh file that run R file. An .sh file to help run R file might look like this:

> #!/bin/bash

> export NJOBS_HIGH=200

> R CMD BATCH --no-save RunGlmnet.R RunGlmnet.out

Then inside the RunGlmnet.R file, we can get the variable set outside NJOBS_HIGH as:

> Sys.getenv('NJOBS_HIGH')

Doing this will not save NJOBS_HIGH as a global variable (exists all the time), but only create this variable when running the above .sh script.

Turns out there is some complication. After restarting R, things are back to before. To set things permanently, we have to change ${R_HOME}/etc/Renviron file. Where ${R_HOME} can be obtained from R:
> Sys.getenv("R_HOME")

Friday, November 7, 2014

List of Classification Algorithm

Model
1. Linear Regression
2. Logistic Regression
3. SVM
4. QDA (LDA ~ Linear Regression)
5. Random Forest
6. Neural Network
7. Naive Bayes

Feature Engineering
1. Include polynomial term: X^2
2. Interactive Term: X1*X2
3. Sqrt(abs(X))*sign(X)
4. Log(abs(X)+1)*sign(X)

Regularization
1. L1; L1 then OLS on chosen x's
2. L2
3. L1 + L2
4. Adaptive L1
5. Forward Stepwise / Backward Stepwise
6. Preselect with OLS then L1

Model Selection
1. CV,
2. AIC, AICc
3. BIC,
4. Cp

Performance Measure
1. AUC
2. Accuracy
3. F1 Score, Mean Average Precision, Cohen's Kappa
4. LogLoss / Deviance
5. Mutual Information, KL convergence
6. Cluster Correlation as defined in the paper
7. Own method of double centering the joint count table (for binary vs. binary)

Algorithm
1. Newton
1.1. Quasi-Newton
1.2. BFGS
1.3. LBFGS
2. Gradient Descent
2.1. Stochastic Gradient Descent
2.2. Coordinate Descent
2.3. Conjugate Descent
3. Least Angle Regression

Thursday, November 6, 2014

Correlation for Classification

In regression, correlation is one measure to rule them all. It is equivalent to R-squared (corr^2), equivalent to MSE (by equivalent I mean 1-1 correspondence).
One can pick an input x1 over input x2 by comparing the the two correlation with output y. Of course unless we talk about L1 correlation, but it is not too much needed.

Things are messy in classification. There is no singly measure of fitness between two variable.
1. For discrete vs. discrete, one can choose a few measure e.g. accuracy, which is defined as sum of diagonal term of the joint probability matrix. One can use a measure used in comparing two clusters. Will have to talk more about this later
2. For discrete vs. continuous, area under the curve of ROC is very good. But please don't use the Riemann integration approach to calculate the area, as it would be O(n^2). One can use:

auc <- function(truth, preds)
{
  r = truth[order(preds)]
  n.truth = sum(r); n = length(r)
  sum(n.truth - cumsum(r)[!as.logical(r)])/n.truth/(length(r)-n.truth)
}

Or use the function in glmnet: auc.

Other possible metric: mutual information / KL convergence / log likelihood based. Need to read more.

Thursday, October 23, 2014

Matrix, diagonalizable, eigenvalues

I never quite understand what matrix is diagonalizable, what are the relation of eigenvalues with special matrix, until now. The main goal here is to understand real square matrix, but occasionally we will need to refer to complex matrix as well.

Real Invertible Matrix iff eigenvalues are in C\{0}. Nonzero eigenvalues
Hermitian matrix, and special case symmetric real matrix implies real eigenvalues. All real eigenvalues does not imply symmetric. Counter example: [[0,1],[4,0]] has eigenvalues 2 and -2, and is not symmetric. More later on what can be implied from all real eigenvalues.
A "positive definite" matrix, i.e. x'Ax>0 for all x not 0, has all eigenvalues with positive real part. A symmetric/Hermitian and positive definite matrix implies eigenvalues are real and positive, since symmetric/Hermitian implies the real, and positive definite implies the positive real part. Positive eigenvalues and symmetric/Hermitian imply positive definite. Positive eigenvalues alone does not imply positive definite. E.g. [[1,-3],[0,1]] has double eigenvalues 1, but for v = [1,1], the quadratic form is negative.
A skew-symmetric real matrix have eigenvalues with real part equal to zero.
In the complex setting, a matrix is unitary diagonalizable iff it is a normal matrix AA*=A*A. It is a two way relation. Spectral Theorem.
In the real setting, a matrix is orthogonal diagonalizable iff it is symmetric. Again it is a two way relation. One way is implied from Spectral. The other is easy by definition.
In the real setting, a matrix is diagonalizable (i.e. A = PDP^{-1} for P invertible, no need for orthonormal as above) iff it has all real eigenvalues, and its minimal polynomial has no repeated root. Having all real eigenvalues alone is not enough, for example: [[1,1,0],[0,1,0],0,0,1]. This is in Jordan Canonical Form, so can not be diagonalized further. In fact the eigenvalues is all one, and its has only two eigenvectors (1,0,0) and (0,0,1). The eigenspace have dimension 2.
In the complex setting, a matrix is diagonalizable again iff its minimal polynomial has no repeated root (complex already contains all root so no need for first condition).

Finally, for any general n*p matrix A, A = UDV, for U n*n, V p*p are orthogonal matrix, D n*p matrix with r positive elements down the diagonal, every other element zero. r is the rank of A, A', A'A, AA'.

Wednesday, October 22, 2014

List of Useful R package, Command

library(colorout) # Color R output

library(rARPACK) # Calculate a few eigenvalues very fast

library(fields)  # Awesome heatmap plot with legend

!exists("x")  # Check if variable x exists

 

R, Vim, and Tab Completion

Following on the R, Vim, and MacOS blog, I now want to have autocompletion in Vim.
There is a Vim plugin that does this called SuperTab. However it did not work for me at the beginning. Turns out this is because of one setting line in my .vimrc, which was copied from here in order to have Vim works with R. The evil line is:

set paste  " turns on traditional pasting of text

After removing that, tab completion works for keyword. This means if I have written a variable name aefearw before, when I type aef and press Tab, Vim will suggest me that word.

This is not enough, I also want language-specific autocompletion (e.g. R function names). There is a setting in SuperTab that can chain autocompletion, detailed here (at the end of documentation). I copy the two blocks of code from there to my vimrc. It seems to work now except that it still doesn't recognize path. If I remove the second path of the codes below then it does recognize path but not R function names. Will investigate more later. Meanwhile this is good enough.

autocmd FileType *
    \ if &omnifunc != '' |
    \   call SuperTabChain(&omnifunc, "<c-p>") |
    \   call SuperTabSetDefaultCompletionType("<c-x><c-u>") |
    \ endif

let g:SuperTabDefaultCompletionType = 'context'
  autocmd FileType *
      \ if &omnifunc != '' |
      \   call SuperTabChain(&omnifunc, "<c-p>") |
      \ endif

Tuesday, October 21, 2014

Linear Algebra on R part II

It turns out rARPACK is the package of choice when dealing with calculating a few eigenvalues and eigenvectors. To summarize this mess of matrix linear algebra library, we have:

1. BLAS: the basic linear algebra e.g. matrix multiplication
2. LAPACK: the second level linear algebra e.g. SVD, full eigendecomposition
3. ARPACK: sparse linear algebra, which means calculating a few eigenvalues / eigenvectors

openBLAS, intel MKL are the implementation of these library. In C++, Armadillo is an interface for BLAS, LAPACK, and ARPACK. In R, there is RCPP and RcppArmadillo is an interface to Armadillo. Unfortunately the eigs_sym function does not work, as R is not linked with ARPACK as explained here.

At the end, I found rARPACK, which is an interface for ARPACK. It uses Lanczos algorithm. The irlba package written in R also uses this algorithm to do the same thing, but at a slower speed. If we were to just use the full eigendecomposition function it would be very slow on large matrix.

A few benchmark on Macbook Air 2014 13inch Rstudio:

library("corpcor")
library("rARPACK")
#library(rbenchmark)
library(microbenchmark)
library(irlba)

# generate a "fat" data matrix
n = 5000
X = matrix(rnorm(n*n), n, n)
X = t(X)%*%X

# compute SVD
system.time( (s1 = eigen(X,symmetric = TRUE)) ) 
#   user  system elapsed 
# 58.487   1.042  35.895 
system.time( (s2 = svd(X, nu = 5, nv = 0)))
# user  system elapsed 
# 304.342   5.497 138.519
system.time( (s3 = fast.svd(X)) )
# user  system elapsed 
# 396.007   8.395 187.781 
system.time( (s4 = eigs_sym(X, 5)))
# user  system elapsed 
# 1.909   0.029   2.013 
system.time( (s5 = svds(X,k=5)) )
# user  system elapsed 
# 7.686   0.082   4.113 
system.time( (s6 = irlba(X, nu = 5, nv = 0)))
# user  system elapsed 
# 27.559   0.684  12.167

Note that in a Mac, to use multithreaded linear algebra, you might need to replace R's default BLAS library with Mac BLAS implementation name Accelerate following thisguide.

Linear Algebra on R from C++

So one of my end goal is given a data matrix size n*k, where n = 50,000, k = 400. I want to calculate the distance matrix, which should have the size of 50k*50k by Manhattan distance. I then want to use the first three eigenvectors to do clustering, i.e. K-means.

In R, I use dist function. This is very fast but using only one core. And I have no idea how to make it use multiple cores. In R, there is a library irlba which calculate the first few eigenvectors pretty fast.

I want to use C++ to do this job. Now for both R and C++, the backend Linear Algebra is done by BLAS (the basic linear algebra such as addition multiplication), and LAPACK (the higher level linear algebra such as SVD decomposition, QR, Cholesky etc.). Performance of the the basic linear algebra shouldn't be different between R and C++ since R's functions are written in C++.

In C++ there are two competing Linear Algebra packages namely Armadillo and Eigen. Eigen does not use BLAS and LAPACK, and consensus is Eigen slightly faster than Armadillo. Armadillo uses BLAS and LAPACK, basically a wrapper for these basic libraries, as the basic libraries have very ugly syntax.

1. How to use Armadillo in R. First install RcppArmadillo in R. Then in C++ write this:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec getEigenValues(arma::mat M) {
    return arma::eig_sym(M);
}

Note that both of the comment are needed for the function to work. The second load the function into R environment. The first declares dependency of RcppArmadillo on Rcpp.
On R write this:

library('Rcpp')
library('microbenchmark')
working.directory <- PATH # Fill in your path here
sourceCpp(file.path(working.directory, 'EigenARMA.cpp'))
n = 2e3
M = matrix(rnorm(n^2),nrow = n, ncol = n)
M = t(M)%*%M
print(system.time(eigen(M)))
print(system.time(getEigenValuesARMA(M)))

2. How to use Eigen in R. Again install RcppEigen in R. Then in C++ write this:

#include <RcppEigen.h>

// [[Rcpp::depends(RcppEigen)]]

using Eigen::Map;                // 'maps' rather than copies 
using Eigen::MatrixXd;                  // variable size matrix, double precision
using Eigen::VectorXd;                  // variable size vector, double precision
using Eigen::SelfAdjointEigenSolver;    // one of the eigenvalue solvers

// [[Rcpp::export]]
VectorXd getEigenValuesEG(Map<MatrixXd> M) {
    SelfAdjointEigenSolver<MatrixXd> es(M);
    return es.eigenvalues();
}

In R write this:

library('Rcpp')
library('microbenchmark')
working.directory <- PATH # Fill in your path here
sourceCpp(file.path(working.directory, 'EigenEG.cpp'))
n = 2e3
M = matrix(rnorm(n^2),nrow = n, ncol = n)
M = t(M)%*%M
print(system.time(eigen(M)))
print(system.time(getEigenValuesEG(M)))

Reference:
RcppEigen Example
RcppArmadillo Example

3. Now comparing the performance: calculating eigenvalues of matrix 2000*2000.

Arma: 3.459s

Eigen: 46.625s

R's built in eigen: 12.547s

But the difference here is because Eigen.cpp only uses one core, while both ARMA and R's built in function use multicores. To set the number of core Blas uses in R, insert this at the beginning:

library(RhpcBLASctl)
blas_set_num_threads(1)
# For 1 means using one core.

Time now is:

Arma: 4.319s

Eigen: 47.635s

R's built in eigen: 8.434s

Everything uses one core now, but there might be still a difference as the Arma seems to work for symmetric matrix. Also it returns only the eigenvalues, not eigenvectors.

4. Other discussion:
It seems rARPACK provides an interface to ARPACK in R directly too. Will check this out.

Tuesday, September 30, 2014

R + VIM + Mac OS (10.10)

So I've decided to move my R work from R studio to the terminal. Apparently the popular choice is Emacs + R and some plugin. However, I am not familiar with Emacs, and decided to find a way to make R work with Vim. Luckily things work out. Here are the steps:

1. We need Vim 7.4. Mac OS comes with Vim 7.3. Download Vim 7.4 source.
Idea from http://stackoverflow.com/questions/7211820/update-built-in-vim-on-mac-os-x
./configure --prefix=/opt/localmake
make
Now if you are on a Beta version of Mac, there will be an error because compiler doesn't recognize Mac version. Head over to Patch and download Vim-sigalsstack.patch

Copy this file to the vim installation folder. Do patch -p1 < vim-sigalstack.patch
according to this Patch.

Now do:
make
sudo make install

And add to bash_profile by typing:
echo 'PATH=/opt/local/bin:$PATH' >> ~/.bash_profile
Now the system vim is still there, but we are using our new vim.

2. Follow this guide Vim - R.

Should be good now.

Sunday, September 28, 2014

Regularization and Beyesian Interpretation

For a linear model y = Xb + e.

1. Minimizing MSE || y - Xb ||^2 is equivalent to maximum likelihood with no distribution on beta, normal distribution on e. Since normal distribution is just exp(-0.5 * x^2). That x^2 is equivalent to ||y - Xb||^2.

2. Minimizing Mean Absolute Deviation |y-Xb| is equivalent to maximum likelihood with no distribution on beta, Laplace distribution on e. Since Laplace distribution is just exp(-|x|).

3. Minimizing MSE with L2 regularization is equivalent to maximum likelihood with normal distribution on both b and e. Since MSE ~ normal e, and L2 ~ normal b.

4. Minimizing MSE with L1 regularization is equivalent to maximum likelihood with normal distribution on e, and Laplace distribution on b.

5. Minimizing MSE with L0 loss is equivalent to maximum likelihood with normal distribution e, b has a distribution of something like super long tail, super center at zero???

6. Maximum Likelihood with normal e, Cauchy prior for b is kind of like minimizing MSE with L0.00001 on beta. The L0.000001 is to denote that instead of regularization |b| for L1, |b|^2 for L2, it is log(|b|+1)