clustering very large dataset in R

clustering in r
k-means clustering in r
hierarchical clustering in r
fastcluster
clara clustering
how to interpret k-means clustering results in r
k-means clustering from scratch in r
hierarchical clustering large datasets

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

How to cluster a very large dataset in R?, I have a very large dataset consisting of 70K numeric values representing various distances ranging from 0 Is there any way to solve such  Cluster Analysis for large data in R. I am trying to perform a clustering analysis for a csv file with 50k+ rows, 10 columns. I tried k-mean, hierarchical and model based clustering methods. Only k-mean works because of the large data set. However, k-mean does not show obvious differentiations between clusters.

70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.

But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2) (or even worse, O(n^3)) runtime.

You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.

If you insist on using R, give at least kmeans a try and the fastcluster package. K-means has runtime complexity O(n*k*i) (where k is the parameter k, and i is the number of iterations); fastcluster has an O(n) memory and O(n^2) runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3) runtime and O(n^2) memory).

Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.

Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.

How do i perform a cluster analysis on a very large data set in R , How do i perform a cluster analysis on a very large data set in R? Which will be the best (complete or single linkage) method? I have a dataset of X axis =197  That’s very fast on 200 observations, but can be very computationally expensive in case you have a large data set. In reality, it is quite likely that you will have to clean the dataset first, perform the necessary transformations from strings to factors and keep an eye on missing values.

Another non-matrix oriented approach, at least for visualizing cluster in big data, is the largeVis algorithm by Tang et al. (2016). The largeVis R package has unfortunately been orphaned on CRAN due to lacking package maintenance, but a (maintained?) version can still be compiled from its gitHub repository via (having installed Rtools), e.g.,

library(devtools)     
install_github(repo = "elbamos/largeVis")

A python version of the package exists as well. The underlying algorithm uses segmentation trees and a neigbourhood refinement to find the K most similar instances for each observation and then projects the resulting neigbourhood network into dim lower dimensions. Its been implemented in C++ and uses OpenMP (if supported while compiling) for multi-processing; it has thus been sufficiently fast for clustering any larger data sets I have tested so far.

Clustering Very Large Data Sets with Principal Direction Divisive , centroids. Then, all data points are assigned to the closest centroid. Introduction to Clustering in R Clustering  is a data segmentation technique that divides huge datasets into different groups on the basis of similarity in the data. It is a statistical operation of grouping objects. The resulting groups are clusters.

CLARA in R : Clustering Large Applications, containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data.

CLARANS, and PAM is that the former only checks a sample of the neighbors of a node. How do i perform a cluster analysis on a very large data set in R? Which will be the best (complete or single linkage) method? I have a dataset of X axis =197 compounds, Y = 780 descriptors in excel?

K-Means Clustering on Big Data, CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach. I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is

Comments
  • Is this solution (using cluster::clara) relevant/useful?
  • no not really cause the problem is that the distance matrix will be too large to fit into any memory
  • using your method and after running cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1) I get this error : Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) : object 'data.clust' not found