Clustering of synthetic control data in R

This is an R implementation for clustering example provided with Mahuot. The orignal problem description is:

A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and so resembles real world information in an anonymized format. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets. At the end of this example, you’ll get to learn how to perform clustering using Mahout.

We will be doing the same but using R instead of Mahout. The input dataset is available here.

For running this example, in addition to R, you also need to install the flexclust package available from CRAN. It provides a number of methods for clustering and cluster-visualization.

Here is the script:

x <- read.table("synthetic_control.data")
cat( "read", length(x[,1]), "records.\n")

# load clustering library
library(flexclust)

# get number of clusters from user
n <- as.integer( readline("Enter number of clusters: ")) 

# run kmeans clustering on the dataset
cl1 <- cclust(x, n)

print("clustering complete")

# show summary of clustering
summary(cl1)

# plot the clusters
plot(cl1, main="Clusters")

readline("Press enter for cluster histogram")
m<-info(cl1, "size") # size of each cluster
hist(rep(1:n, m), breaks=0:n, xlab="Cluster No.", main="Cluster Plot")

readline("Press enter for a plot of distance of data points from its cluster centorid")
stripes(cl1)
print("done")

Here are the graphs produced when we run the above script with no. of clusters, n=7

Clusters

clusters

Frequency Histogram

frequency

Distance from centroid

centroid distance