Here I describe a project that spinned-off from a recent
Winton's Machine Learning competition.
I am not concerned with outlining the solution or covering some other details of this rather difficult
competition. My main goal is to describe the new R package, that solves a problem of finding correlated
series on a massive scale. There are some other packages on the market such as amap
, but I found
that these could not cope with size of this problem.
With my R package the resources sitting on my laps were just about sufficient for the exercise, which is $\mathcal{8}$GB RAM of my \(\mathcal{4}\)-cores laptop processing \(\mathcal{40}\mathrm{K}\) events. Unfortunately, the quadratic running time and, more importantly for R, quadratic memory footprint pose a conceptual limitation of the scalability meaning that twice bigger input would render useless any machine in my reach.
Nonetheless, this R package can be useful for any data analysis that requires finding and grouping together correlated events comprising series of measurements (e.g. time series) for a dataset of up to \(\mathcal{O(50\mathrm{K})}\) events.