Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.


K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.

That said, I haven't used scipy.cluster.vq, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.

(PDF) K-strings algorithm, a new approach based on Kmeans, K-strings algorithm, a new approach based on Kmeans. Viet-Hoang Le Using the above three formulas we can standardize features of. connection record in� K-means is a popular clustering algorithm which is widely used in anomaly-based intrusion detection. It tries to classify a given data set into k (a predefined number) categories.

One problem you would face if using scipy.cluster.vq.kmeans is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.

That seems... difficult. Perhaps you are looking for Levenshtein distance instead?

Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids (aka PAM), for instance, can be applied to data with an arbitrary distance metric.

For example, using Pycluster's implementation of k-medoids, and nltk's implementation of Levenshtein distance,

import nltk.metrics.distance as distance
import Pycluster as PC

words = ['apple', 'Doppler', 'applaud', 'append', 'barker', 
         'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']

dist = [distance.edit_distance(words[i], words[j]) 
        for i in range(1, len(words))
        for j in range(0, i)]

labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
    cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():

yields a result like

['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']

K-means data clustering, This tutorial explores the use of k-means algorithm to cluster data. The function loadd uses the GAUSS formula string format which allows for Visualizing the data can be one helpful step towards choosing the correct number of clusters. K-Means clustering algorithm is defined as a unsupervised learning methods having an iterative process in which the dataset are grouped into k number of predefined non-overlapping clusters or subgroups making the inner points of the cluster as similar as possible while trying to keep the clusters at distinct space it allocates the data points

K-means only works with euclidean distance. Edit distances such as Levenshtein don't even obey the triangle inequality may obey the triangle inequality, but are not euclidian. For the sorts of metrics you're interested in, you're better off using a different sort of algorithm, such as Hierarchical clustering:

Alternately, just convert your list of RNA into a weighted graph, with Levenshtein weights at the edges, and then decompose it into a minimum spanning tree. The most connected nodes of that tree will be, in a sense, the "most representative".

Text clustering with K-means and tf-idf | by Mikhail Salnikov, Same words in different strings can be badly affected to clustering life we can use scikit-learn implementation of TF-IDF and KMeans and I� The k-means algorithm can easily be used for this task and produces competitive results. A use case for this approach is image segmentation . Other uses of vector quantization include non-random sampling , as k -means can easily be used to choose k different but prototypical objects from a large data set for further analysis.

kmeans text clustering, Given text documents, we can group them automatically: text clustering. We'll use KMeans which is an unsupervised machine learning algorithm. I've collected In our example, documents are simply text strings that fit on the screen. In a real� Introduction to K-means Clustering. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.

Clustering a long list of strings (words) into similarity groups, Which you can get by multiplying the Levenshtein distance by -1. I threw together a quick example using the first paragraph of your question as input. In Python 3: K-Means is a clustering algorithm with one fundamental property: the number of clusters is defined in advance. In addition to K-Means, there are other types of clustering algorithms like Hierarchical Clustering, Affinity Propagation, or Spectral Clustering. 3.2. How K-Means Works

How to use KMeans clustering to find similar strings?, Why can't you just do hierarchical clustering using some sort of edit distance? EDIT: I guess you can't calculate the mean of the cluster directly, but you could use� from K-means clustering, credit to Andrey A. Shabalin. As, you can see, k-means algorithm is composed of 3 steps: Step 1: Initialization. The first thing k-means does, is randomly choose K examples (data points) from the dataset (the 4 green points) as initial centroids and that’s simply because it does not know yet where the center of each cluster is.

Interactive network, From the 'Clustering' menu it is possible to launch 2 different clustering You can move the slider in order to start the clustering with the right We suggest you to use the MCL algorithm (see review paper below). For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster. I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space.

  • This answer doesn't make any sense. What is the "distance" between two strings of RNA such that it A) obeys the triangle inequality and B) is euclidean? There are many clustering algorithms, and it seems beyond me how k-means in particular would be useful in this circumstance.
  • The distance I am using is the structural distance, for example sequences: (1) "(((....)))" and (2) "((((..))))" Have a distance of 1 since the only difference in an insertion
  • Jerry, can you please explain how this can possibly work? As @sclv mentioned in his answer, K-means only works with Euclidean distance. It seems impossible to apply it to strings since at each step, you need to shift the centroids to an absolute position representing the mean of the closest data points... For arbitrary distance metrics, it seems that K-medoids would work instead since it uses data points as centroids instead.
  • @codesparkle: I didn't try to tell him that his measure of distance would work--I simply pointed out (at least part of) the requirement for K-means to work. There definitely are measures of differences between strings that fit those requirements (e.g., cosine similarity). Oh, and sclv's claim isn't entirely correct either: cosine similarity isn't strictly a Euclidean measure, but does just fine. Multiple definitions of centroid are accepted wrt k-means (e.g., 3), with differing requirements on the measure used.
  • This answer is just wrong. K-means also needs to compute means, and that requires floats, and requires squared Euclidean or Bergman divergences as "distance".
  • Levenshtein Distance and the Triangle Inequality
  • Thanks, fixed! Embarrassingly, the author of the blog is a friend of mine :-)