Convert distance pairs to distance matrix to use in hierarchical clustering

distance matrix clustering python
hierarchical clustering distance matrix python
hierarchical clustering linkage
condensed distance matrix
python hierarchical clustering
dendrogram python
similarity matrix clustering python
agglomerative clustering distance matrix

I am trying to convert a dictionary to a distance matrix that I can then use as an input to hierarchical clustering: I have as an input:

  • key: tuple of length 2 with the objects for which I have the distance
  • value: the actual distance value

    for k,v in obj_distances.items():
    print(k,v)
    

and the result is :

('obj1', 'obj2') 2.0 
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80

My question is how can I convert this into a distance matrix that I can later user for clustering in scipy?

You say you will use scipy for clustering, so I assume that means you will use the function scipy.cluster.hierarchy.linkage. linkage accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist), for a discussion on the condensed form.)

So all you have to do is get obj_distances.values() into a known order and pass that to linkage. That's what is done in the following snippet:

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
    ('obj2', 'obj3'): 1.8,
    ('obj3', 'obj1'): 1.95,
    ('obj1', 'obj4'): 2.5,
    ('obj1', 'obj2'): 2.0,
    ('obj4', 'obj2'): 2.1,
    ('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b.  If this is already true, then the next three lines can be
# replaced with
#     sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

The dendrogram:

Introduction to Hierarchical Clustering, What does measuring distance between clusters mean in case of complete linkage? Compute the Euclidean distance between pairs of observations, and convert the distance vector to a matrix using squareform. Create a matrix with three observations and two variables. rng( 'default' ) % For reproducibility X = rand(3,2);

Use pandas and unstack the dataframe:

import pandas as pd

data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}

df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values

yeilds

In [15]: dist_matrix
Out[15]:

array([[2.  , 1.95,  nan],
       [ nan, 1.8 ,  nan],
       [ nan,  nan, 1.58]])

10.2, I am trying to convert a dictionary to a distance matrix that I can then use as an input to hierarchical clustering: I have as an input: key: tuple of  I am trying to convert a dictionary to a distance matrix that I can then use as an input to hierarchical clustering: I have as an input: key: tuple of length 2 with the objects for which I have the distance; value: the actual distance value. for k,v in obj_distances.items(): print(k,v) and the result is :

This will be slower than the other answer posted, but will ensure that values both above and below the middle diagonal are included, if that's important to you:

import pandas as pd

unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)

for k, v in obj_distance.items():
    df.loc[k[0], k[1]] = v
    df.loc[k[1], k[0]] = v

Results:

      obj1 obj2  obj3  obj4
obj1   NaN    2  1.95   NaN
obj2     2  NaN   1.8   NaN
obj3  1.95  1.8   NaN  1.58
obj4   NaN  NaN  1.58   NaN

Hierarchical Clustering, Hierarchical clustering functionality in R is great, right? dissimilarity between each pair of cars from the mtcars dataset, and put them in a data.frame called carJaccard . We now need to convert this into a distance matrix. To make it easier to see the distance information generated by the dist() function, you can reformat the distance vector into a matrix using the as.matrix() function. # Reformat as a matrix # Subset the first 3 columns and rows and Round the values round(as.matrix(dist.eucl)[1:3, 1:3], 1)

Doing hierarchical clustering with a precalculated dissimilarity index, The agglomerative hierarchical clustering algorithms available in this Calculate a new set of distances dkm using the following distance formula. d as the unweighted pair-group centroid method, this method defines the distance between two (if any) or the columns of the matrix if distance or correlation matrix input was. The distance information used by this node is either read from a distance vector column that must be available in the input data or is computed directly with usage of a connected distance measure. You can always calculate the distance matrix using the corresponding calculate node.

[PDF] Hierarchical Clustering / Dendrograms, I have a (symmetric) matrix M that represents the distance between each pair of nodes. First of all, when you use hierarchical clustering, be sure you define the One way to highlight clusters on your distance matrix is by way of symmetric = TRUE) v <- eig$values[1:2] #convert negative values to 0. v[v  In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

Clustering with a distance matrix, Compute Euclidean Distance and Convert Distance Vector to Matrix between pairs of observations, and convert the distance vector to a matrix using squareform . For details, see Hierarchical Clustering and the function reference pages for  We now need to convert this into a distance matrix. The first step is to make sure that we don’t have any NA ’s in the matrix, which happens when you have pairings created in your matrix that don’t exist in your dataframe of dissimilarity scores. If there are any, these should be replaced with whatever value indicates complete

Comments
  • You can first create a matrix of zeros and then use int(a[-1]) as the indices of your matrices where a is 'obj1', 'obj2' etc. and store the distance values in that array
  • Is your set of distances complete? That is, does the distance dictionary contain a distance for each possible pair? For example, the data that you show doesn't include distances for ('obj1', 'obj4') or ('obj2', 'obj4'). You'll need these values to do clustering.
  • Hi @WarrenWeckesser, yes it is, I just omitted it to save space, but yes I have all pairwise distances, thanks
  • Thanks @WarrenWeckesser exactly what I was looking to do, appreciate it!
  • Thanks a lot @thesilkworm