Analysis of correlated series with R clustcorr package

Khristian Kotov

Introduction

Here I describe a project that spinned-off from a recent Winton's Machine Learning competition. I am not concerned with outlining the solution or covering some other details of this rather difficult competition. My main goal is to describe the new R package, that solves a problem of finding correlated series on a massive scale. There are some other packages on the market such as amap, but I found that these could not cope with size of this problem.

With my R package the resources sitting on my laps were just about sufficient for the exercise, which is $\mathcal{8}$GB RAM of my $\mathcal{4}$-cores laptop processing $\mathcal{40}\mathrm{K}$ events. Unfortunately, the quadratic running time and, more importantly for R, quadratic memory footprint pose a conceptual limitation of the scalability meaning that twice bigger input would render useless any machine in my reach.

Nonetheless, this R package can be useful for any data analysis that requires finding and grouping together correlated events comprising series of measurements (e.g. time series) for a dataset of up to $\mathcal{O(50\mathrm{K})}$ events.

Problem and solution

Formulation of the problem:

within a single dataset group together all correlated series of measurements

Solution:

find all pair correlations and store them in $\mathcal{O}(N^2)$-sized array
sort the array and iterate over the top correlated elements above a predefined threshold
group pairs of original series corresponding to the array elements selected above

Implementation in R:

Installable from github given you have devtools, compiler, and c++11 standard library:

require(devtools)
install_github("koskot77/clustcorr")

Example with random series

require(clustcorr)
sample <- matrix( rnorm(1000000), ncol=100 )
cl <- cluster.correlations(sample,0.5)
length(cl)

[1] 9996

cl <- recluster.correlations(sample,0.3)
length(cl)

[1] 1

I.e. correlation threshold of $\mathcal{0.3}$ is low enough to merge everything in one cluster

Example with competition's data

Missing values are imputed with impute.R and clustering is done at $\mathcal{0.9}$ threshold:

source("loadData.R")
cl <- cluster.correlations(train_rets[1:10000,],0.9)
plot ( t( train_rets[ cl[[1]][1], ] ), type="l", xlab="time", ylab="return")
lines( t( train_rets[ cl[[1]][2], ] ), col="red" )

plot of chunk unnamed-chunk-3

Summary

The new clustcorr R package is now available for solving problem of finding similar series of observations in relatively large datasets. Although, with a bit of extra effort I could make it scalable for even larger problem sizes, the intrinsically quadratic running time makes such scalability pretty useless.

Good links:

https://www.otexts.org/fpp - canonical text book on time series by Rob Hyndman
http://robjhyndman.com/uwafiles - the same but in form of slides
A similar overview of time series analysis
Time series characteristics
Few words on solution to the original competition
"Wavelet Methods in Statistics with R" documenting Wavethresh R package