How do I force clustering of data in a specific evident pattern?

python clustering example
hierarchical clustering python
latest research papers on clustering in data mining
clustering algorithms
sklearn clustering
graph clustering python
spatial clustering python
hdbscan clustering python

I have a large set of 'Vehicle speed vs Engine RPM' values for a vehicle. I'm try to predict the time spent by the vehicle on each gear.

I ran K-Means clustering on the dataset and got the following result:

Clearly, my algorithm has failed to capture the evident pattern. I want to force K-Means (or any other clustering algorithm, for that matter) to cluster data along the six sloped lines. Snippet of relevant code:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

plt.rcParams['figure.figsize'] = (16, 9)'ggplot')

# Importing the dataset
data = pd.read_csv('speedRpm.csv')

# Getting the data points
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))

# Number of clusters
k = 5

kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans =
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_

labeled_array = {i: X[np.where(kmeans.labels_ == i)] for i in range(kmeans.n_clusters)}

colors = ['r', 'g', 'b', 'y', 'c']
fig, ax = plt.subplots()
for i in range(k):
        points = np.array([X[j] for j in range(len(X)) if kmeans.labels_[j] == i])
        ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='#050505')

How do I make sure the clustering algorithm captures the right pattern, even though it possibly isn't the most efficient?



Ran the same set of points using DBSCAN this time. After playing around with the eps and min_samples values for sometime, got the following result:

Although, still not perfect and way too many outliers, the algorithm is beginning to capture the linear trend.


import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

plt.rcParams['figure.figsize'] = (16, 9)'ggplot')

# Importing the dataset
data = pd.read_csv('speedRpm.csv')

# Getting the values and plotting it
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))


# Compute DBSCAN
db = DBSCAN(eps=1.1, min_samples=3).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "Estimated Number of Clusters", n_clusters_

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
High Level

There are two major options here:

  1. Transform your data so that k-means-style clustering algorithms succeed
  2. Pick a different algorithm

Minor option:

  1. Tweak kmeans by forcing the initialization to be smarter

Option 2

Python has a good description of several clustering algorithms here . From the link, a (crudely cropped) helpful graphic:

This row looks similar to your dataset; have you tried a Gaussian mixture model? A GMM has few well known theoretical properties, but it works by assigning probabilities that points belong to each cluster center based on a posterior calculated from the data. You can often initialize it with kmeans, which Sklearn does for you.

Similarly, desnity-based clustering algorithms (DBSCAN, e.g.), seem like a logical choice. Your data has a nice segmentation of dense clusters, and this seems like a good topological property to be filtering for. In the image on the linked wikipedia page:

they offer the caption:

DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means

which seems to speak to your troubles.

More on your troubles

Kmeans is an extremely versatile algorithm, but it is not globally optimal and suffers from a lot of weak-points. Here is dense reading

In addition to problems like the mickey mouse problem, kmeans is often trying to minimize simple euclidean distance to the centroids. While this makes a lot of sense for a lot of problems, it doesn't make sense in yours, where the skew of the clusters means that isn't quite the correct measure. Notice that other algorithms like agglomerative/hierarchical clustering, shown above, that use similar measures, have similar trappings.

I haven't covered transforming your data or tweaking kmeans because the latter requires actually hacking into (or writing your own) clustering algorithm (I don't recommend for a simple exploratory problem given the coverage of sklearn and similar packages), where the former seems like a local solution sensitive to your exact data. ICA might be a decent start, but there's a lot of options for that task

[PDF] Combining Multiple Clusterings Using Evidence , The final data partition of the n patterns is obtained by applying a hierarchical clustering algorithms to the given data and then determine the best algorithm for the data. Inspired by b partitions were obtained by forcing k−cluster solutions. Clustering is a data analysis task that groups similar data items together [1] and can be viewed as an unsupervised classification problem since the class (or cluster) labels are not given [2], [3]. Clustering is a fundamental problem in machine learning with a long history dating back to the 1930s in psychology [4].

k-means (and other clustering algorithms quoted in the @en-knight answer) are meant for multi-dimensional data that tends to have groups of data points that are 'close' to each other (in terms of Euclidean distance), but separated spatially.

In your case, if data is considered in your un-processed input space (rpm vs velocity) the 'clusters' that are formed are very elongated and largely overlap in the region near (0,0), so most if not all methods based on Euclidean distance are bound to fail.

Your data isn't really 6 groups of 2-dimensional points that are spatially separated. Instead, it is actually a mix of 6 possible linear trends.

Therefore, the grouping should be based on x/y (the gear ratio). It is 1-dimensional: each (rpm,velocity) pair corresponds to a single (rpm/velocity) value and you want to group those.

I don't know if the k-means (or other algorithms) can take a 1-D data set, but if it cannot, you can create a new array with pairs like [0, rpm/vel] and run that through it.

You may want to look for a 1-D algorithm that's more efficient than the multi-dimensional generic ones.

This will make the graph labeling a bit more involved because the grouping is computed on a derivative data set that has a different shape (1 x samples) than the original data which is (2 x samples), but mapping them isn't difficult.

Neural Nets WIRN Vietri-99: Proceedings of the 11th Italian , We assume that the data patterns have a fictitious unitary mass and are subjected to a gravitational force proportional to a scale parameter. It is evident the constructive nature of the algorithm: each cluster increases its attraction force N.(p) is the number of clusters determined during training for a given value of the scale  Clustering is the process of making a group of abstract objects into classes of similar objects. Points to Remember. A cluster of data objects can be treated as one group. While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.

You could multiply your y-values by a factor of 10 or more, so they spread out along that axis. Make sure you keep track of whether you're working with the real values or the multiplied values.

Comparing Python Clustering Algorithms, As with every question in data science and machine learning it depends on for clustering text data, and other algorithms specialize in other specific kinds of data​. but in practice on messy real world data the 'obvious' choice is often far from  1 Answer 1. It's no surprise that clustering is used for pattern recognition at large, and image recognition in particular: clustering is a reducing process, and images in this megapixel era need boiling down It is also a process which produces categories and that is of course useful.

Using cluster analysis to identify patterns in students' responses to , Force Concept Inventory questions in an introductory physics course at the University of Arkansas. The The course data were then clustered and the extracted model physics laboratory, students were given a 36 question pretest using These figures show little evidence of increasing con- sistency as  explore a more general type of subspace clustering which uses pattern similarity to measure the distance between two objects. 1.1 Goal Most clustering models, including those used in subspace clustering, define similarity among different objects by dis-tances over either all or only a subset of the dimensions.

[PDF] Survey of Clustering Data Mining Techniques, data by fewer clusters necessarily loses certain fine details, but achieves simplification. From a machine learning perspective clusters correspond to hidden patterns, the There is experimental evidence that compared with Forgy's algorithm, the second Due to a fine seasonality of sales such brute force approaches. Cluster analysis is a classification of objects from the data, where by classification we mean a labeling of objects with class (group) labels. As such, clustering does not use previously assigned class labels, except perhaps for verification of how well the clustering worked. Thus, cluster analysis is distinct from pattern recognition or the areas

5 Clustering, Clustering takes data (continuous or quasi-continuous) and adds to them a new Use clustering when given biomarkers on each of hundreds of thousands cells. They attempted to find a rational explanation for the bombing patterns (​proximity to Relevance is evidenced, e.g., by the fact that they are associated with  Kata kunci: Data Mining, Document Clustering, Sequential Pattern. 1. Introduction Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters[4]. Document clustering has

  • At least scale your data such that the stretched blobs look more like circles. With the right seeds KMeans or another clustering algorithm, like mixture of gaussians should perform better.
  • Another approach would be to use RANSAC or another robust regression algorithm to fit one blob with a line, then remove data around this line and iterate.
  • How did you end up with what is clearly a mix of 6 linear trends, while the vehicle under examination likely has a 5-speed transmission, judging by by your code example (it sets k=5)?
  • @LeoK k=5 means that the clustering algorithm will generate 5 centres, which it appears to from his image (exes and colors). the "6 linear trends" you mention come from the underlying data. There doesn't seem to be a discongruency there to me
  • what I was asking for is whether the data is from a 6-speed car - as it seems to be (while you're trying to group it into 5 groups, which can't give correct grouping, even if there weren't other problems)
  • Thanks for the detailed answer @en_Knight! Ran it with DBSCAN and got better results
  • k-means and any other clustering algorithm I'm familiar with can work in 1-D. Euclidean distance is perfectly well defined along a number line
  • a few other nit picks :) Not all teh algorithms I "quote" group by eucliden distance. Spectral clustering, for example, is doing something else (it's more like a kernalized k-means, but that still isn't l-2 distance). "Almost all the methods based on Euclidean distance are bound to fail" - does this include density clustering? Why so?
  • I'm not familiar with ALL of the algorithms - but many of them use L-2 distance or an imitation thereof and are meant for data in which the 'dimensions' mean the same thing (e.g, coordinates in physical space), here you have rpm and velocity which aren't 'comparable' to each other. Using the knowledge that (x/y) is the thing that's expected to fall into clusters seems to make sense. Density clustering might work, though with the noisy data shown, it will still make mistakes - your sample is less noisy than the OP's and it made visible mistakes even on it. I'd convert to 1-D in all cases.
  • @en_Knight: in any case, I wasn't making claims that your suggested alternatives are 'bad', any one of them could work on the 2-D input set. Nor that I know how all of them work. All I'm saying is that it is clearly a 'difficult' one when looked at as an (x,y) set, but will become easy and neatly split without errors if you use the simple transformation that I suggested.
  • makes sense to me, +1 :)