Tuesday, October 21, 2014

Linear Algebra on R from C++

So one of my end goal is given a data matrix size n*k, where n = 50,000, k = 400. I want to calculate the distance matrix, which should have the size of 50k*50k by Manhattan distance. I then want to use the first three eigenvectors to do clustering, i.e. K-means.

In R, I use dist function. This is very fast but using only one core. And I have no idea how to make it use multiple cores. In R, there is a library irlba which calculate the first few eigenvectors pretty fast.

I want to use C++ to do this job. Now for both R and C++, the backend Linear Algebra is done by BLAS (the basic linear algebra such as addition multiplication), and LAPACK (the higher level linear algebra such as SVD decomposition, QR, Cholesky etc.). Performance of the the basic linear algebra shouldn't be different between R and C++ since R's functions are written in C++.

In C++ there are two competing Linear Algebra packages namely Armadillo and Eigen. Eigen does not use BLAS and LAPACK, and consensus is Eigen slightly faster than Armadillo. Armadillo uses BLAS and LAPACK, basically a wrapper for these basic libraries, as the basic libraries have very ugly syntax.

1. How to use Armadillo in R. First install RcppArmadillo in R. Then in C++ write this:
#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec getEigenValues(arma::mat M) {
    return arma::eig_sym(M);
}
 Note that both of the comment are needed for the function to work. The second load the function into R environment. The first declares dependency of RcppArmadillo on Rcpp.
On R write this:
library('Rcpp')
library('microbenchmark')
working.directory <- PATH # Fill in your path here
sourceCpp(file.path(working.directory, 'EigenARMA.cpp'))
n = 2e3
M = matrix(rnorm(n^2),nrow = n, ncol = n)
M = t(M)%*%M
print(system.time(eigen(M)))
print(system.time(getEigenValuesARMA(M)))
 2. How to use Eigen in R. Again install RcppEigen in R. Then in C++ write this:
#include <RcppEigen.h>

// [[Rcpp::depends(RcppEigen)]]

using Eigen::Map;                // 'maps' rather than copies 
using Eigen::MatrixXd;                  // variable size matrix, double precision
using Eigen::VectorXd;                  // variable size vector, double precision
using Eigen::SelfAdjointEigenSolver;    // one of the eigenvalue solvers

// [[Rcpp::export]]
VectorXd getEigenValuesEG(Map<MatrixXd> M) {
    SelfAdjointEigenSolver<MatrixXd> es(M);
    return es.eigenvalues();
}
In R write this:
library('Rcpp')
library('microbenchmark')
working.directory <- PATH # Fill in your path here
sourceCpp(file.path(working.directory, 'EigenEG.cpp'))
n = 2e3
M = matrix(rnorm(n^2),nrow = n, ncol = n)
M = t(M)%*%M
print(system.time(eigen(M)))
print(system.time(getEigenValuesEG(M)))
 Reference:
RcppEigen Example
RcppArmadillo Example

3. Now comparing the performance: calculating eigenvalues of matrix 2000*2000.
Arma:                    3.459s
Eigen:                   46.625s
R's built in eigen: 12.547s
But the difference here is because Eigen.cpp only uses one core, while both ARMA and R's built in function use multicores. To set the number of core Blas uses in R, insert this at the beginning:
library(RhpcBLASctl)
blas_set_num_threads(1)
# For 1 means using one core. 
Time now is:
Arma:                    4.319s
Eigen:                   47.635s
R's built in eigen:  8.434s
Everything uses one core now, but there might be still a difference as the Arma seems to work for symmetric matrix. Also it returns only the eigenvalues, not eigenvectors.

4. Other discussion:
It seems rARPACK provides an interface to ARPACK in R directly too. Will check this out.

No comments:

Post a Comment