Analysis of correlated series with R clustcorr package

Khristian Kotov

Introduction

Here I describe a project that spinned-off from a recent Winton's Machine Learning competition. I am not concerned with outlining the solution or covering some other details of this rather difficult competition. My main goal is to describe the new R package, that solves a problem of finding correlated series on a massive scale. There are some other packages on the market such as amap, but I found that these could not cope with size of this problem.

With my R package the resources sitting on my laps were just about sufficient for the exercise, which is $\mathcal{8}$GB RAM of my \(\mathcal{4}\)-cores laptop processing \(\mathcal{40}\mathrm{K}\) events. Unfortunately, the quadratic running time and, more importantly for R, quadratic memory footprint pose a conceptual limitation of the scalability meaning that twice bigger input would render useless any machine in my reach.

Nonetheless, this R package can be useful for any data analysis that requires finding and grouping together correlated events comprising series of measurements (e.g. time series) for a dataset of up to \(\mathcal{O(50\mathrm{K})}\) events.

Problem and solution

Formulation of the problem:

  • within a single dataset group together all correlated series of measurements

Solution:

  • find all pair correlations and store them in \(\mathcal{O}(N^2)\)-sized array
  • sort the array and iterate over the top correlated elements above a predefined threshold
  • group pairs of original series corresponding to the array elements selected above

Implementation in R:

  • Installable from github given you have devtools, compiler, and c++11 standard library:
require(devtools)
install_github("koskot77/clustcorr")

Example with random series

require(clustcorr)
sample <- matrix( rnorm(1000000), ncol=100 )
cl <- cluster.correlations(sample,0.5)
length(cl)
[1] 9996
cl <- recluster.correlations(sample,0.3)
length(cl)
[1] 1

I.e. correlation threshold of \(\mathcal{0.3}\) is low enough to merge everything in one cluster

Example with competition's data

Missing values are imputed with impute.R and clustering is done at \(\mathcal{0.9}\) threshold:

source("loadData.R")
cl <- cluster.correlations(train_rets[1:10000,],0.9)
plot ( t( train_rets[ cl[[1]][1], ] ), type="l", xlab="time", ylab="return")
lines( t( train_rets[ cl[[1]][2], ] ), col="red" )

plot of chunk unnamed-chunk-3

Summary

The new clustcorr R package is now available for solving problem of finding similar series of observations in relatively large datasets. Although, with a bit of extra effort I could make it scalable for even larger problem sizes, the intrinsically quadratic running time makes such scalability pretty useless.

Good links: