Thursday, November 6, 2014

Correlation for Classification

In regression, correlation is one measure to rule them all. It is equivalent to R-squared (corr^2), equivalent to MSE (by equivalent I mean 1-1 correspondence).
One can pick an input x1 over input x2 by comparing the the two correlation with output y. Of course unless we talk about L1 correlation, but it is not too much needed.

Things are messy in classification. There is no singly measure of fitness between two variable.
1. For discrete vs. discrete, one can choose a few measure e.g. accuracy, which is defined as sum of diagonal term of the joint probability matrix. One can use a measure used in comparing two clusters. Will have to talk more about this later
2. For discrete vs. continuous,  area under the curve of ROC is very good. But please don't use the Riemann integration approach to calculate the area, as it would be O(n^2). One can use:
auc <- function(truth, preds)
{
  r = truth[order(preds)]
  n.truth = sum(r); n = length(r)
  sum(n.truth - cumsum(r)[!as.logical(r)])/n.truth/(length(r)-n.truth)
}
Or use the function in glmnet: auc.

Other possible metric: mutual information / KL convergence / log likelihood based. Need to read more.

No comments:

Post a Comment