Hot questions for Using Neural networks in cluster analysis

Question:

I found a similar question asked here Determining cluster membership in SOM (Self Organizing Map) for time series data

and I want to learn how to apply self organizing map in binarizing or assigning more than 2 kinds of symbols to data.

For example, let data = rand(100,1) In general, I would be doing data_quantized = 2*(data>=0.5)-1 to get a binary valued transformed series where the threshold 0.5 is assumed and fixed. It may have been possible to quantize data using more that 2 symbols. Can kmeans or SOM be applied to do this task? What should be the input and output if I were to use SOM in quantizing the data?

X = {x_i(t)} for i =1:N and t = 1:T number of time series, N represents the number of components/ variables. To get the quantized value for any vector x_i is to use the value of the BMU, which is nearest. The quantization error will be the Euclidean norm of the difference of the input vector and the best-matching model. Then a new time series is compared / matched using the symbols representation of the time series. WOuld BMU be a scalar valued number or a vector of floating point numbers? It is very hard to picturize what SOM is doing.

Matlab implementation https://www.mathworks.com/matlabcentral/fileexchange/39930-self-organizing-map-simple-demonstration

I cannot understand how to work for time series in quantization. Assuming N = 1, a 1 dimensional array/ vector of elements obtained from a white noise process, how can I quantize / partition this data using self organizing map?

http://www.mathworks.com/help/nnet/ug/cluster-with-self-organizing-map-neural-network.html

is provided by the Matlab but it works for N dimensional data but I have a 1 dimensional data containing 1000 data points (t =1,...,1000).

It shall be of immense help if a toy example is provided which explain how a time series can be quantized into multiple levels. Let, trainingData = x_i;

T = 1000;
N = 1;
x_i = rand(T,N)  ;

How can I apply the code below of SOM so that the numerical valued data can be represented by symbols such as 1,2,3 i.e clustered using 3 symbols? A data point (scalar valued) can be either represented by symbol 1 or 2 or 3.

function som = SOMSimple(nfeatures, ndim, nepochs, ntrainingvectors, eta0, etadecay, sgm0, sgmdecay, showMode)
%SOMSimple Simple demonstration of a Self-Organizing Map that was proposed by Kohonen.
%   sommap = SOMSimple(nfeatures, ndim, nepochs, ntrainingvectors, eta0, neta, sgm0, nsgm, showMode) 
%   trains a self-organizing map with the following parameters
%       nfeatures        - dimension size of the training feature vectors
%       ndim             - width of a square SOM map
%       nepochs          - number of epochs used for training
%       ntrainingvectors - number of training vectors that are randomly generated
%       eta0             - initial learning rate
%       etadecay         - exponential decay rate of the learning rate
%       sgm0             - initial variance of a Gaussian function that
%                          is used to determine the neighbours of the best 
%                          matching unit (BMU)
%       sgmdecay         - exponential decay rate of the Gaussian variance 
%       showMode         - 0: do not show output, 
%                          1: show the initially randomly generated SOM map 
%                             and the trained SOM map,
%                          2: show the trained SOM map after each update
%
%   For example: A demonstration of an SOM map that is trained by RGB values
%           
%       som = SOMSimple(1,60,10,100,0.1,0.05,20,0.05,2);
%       % It uses:
%       %   1    : dimensions for training vectors
%       %   60x60: neurons
%       %   10   : epochs
%       %   100  : training vectors
%       %   0.1  : initial learning rate
%       %   0.05 : exponential decay rate of the learning rate
%       %   20   : initial Gaussian variance
%       %   0.05 : exponential decay rate of the Gaussian variance
%       %   2    : Display the som map after every update

nrows = ndim;
ncols = ndim;
nfeatures = 1;
som = rand(nrows,ncols,nfeatures);


% Generate random training data
    x_i = trainingData;

% Generate coordinate system
[x y] = meshgrid(1:ncols,1:nrows);

for t = 1:nepochs    
    % Compute the learning rate for the current epoch
    eta = eta0 * exp(-t*etadecay);        

    % Compute the variance of the Gaussian (Neighbourhood) function for the ucrrent epoch
    sgm = sgm0 * exp(-t*sgmdecay);

    % Consider the width of the Gaussian function as 3 sigma
    width = ceil(sgm*3);        

    for ntraining = 1:ntrainingvectors
        % Get current training vector
        trainingVector = trainingData(ntraining,:);

        % Compute the Euclidean distance between the training vector and
        % each neuron in the SOM map
        dist = getEuclideanDistance(trainingVector, som, nrows, ncols, nfeatures);

        % Find the best matching unit (bmu)
        [~, bmuindex] = min(dist);

        % transform the bmu index into 2D
        [bmurow bmucol] = ind2sub([nrows ncols],bmuindex);        

        % Generate a Gaussian function centered on the location of the bmu
        g = exp(-(((x - bmucol).^2) + ((y - bmurow).^2)) / (2*sgm*sgm));

        % Determine the boundary of the local neighbourhood
        fromrow = max(1,bmurow - width);
        torow   = min(bmurow + width,nrows);
        fromcol = max(1,bmucol - width);
        tocol   = min(bmucol + width,ncols);

        % Get the neighbouring neurons and determine the size of the neighbourhood
        neighbourNeurons = som(fromrow:torow,fromcol:tocol,:);
        sz = size(neighbourNeurons);

        % Transform the training vector and the Gaussian function into 
        % multi-dimensional to facilitate the computation of the neuron weights update
        T = reshape(repmat(trainingVector,sz(1)*sz(2),1),sz(1),sz(2),nfeatures);                   
        G = repmat(g(fromrow:torow,fromcol:tocol),[1 1 nfeatures]);

        % Update the weights of the neurons that are in the neighbourhood of the bmu
        neighbourNeurons = neighbourNeurons + eta .* G .* (T - neighbourNeurons);

        % Put the new weights of the BMU neighbouring neurons back to the
        % entire SOM map
        som(fromrow:torow,fromcol:tocol,:) = neighbourNeurons;


    end
end


function ed = getEuclideanDistance(trainingVector, sommap, nrows, ncols, nfeatures)

% Transform the 3D representation of neurons into 2D
neuronList = reshape(sommap,nrows*ncols,nfeatures);               

% Initialize Euclidean Distance
ed = 0;
for n = 1:size(neuronList,2)
    ed = ed + (trainingVector(n)-neuronList(:,n)).^2;
end
ed = sqrt(ed);

Answer:

I don't know that I might be misunderstanding your question, but from what I understand it is really quite straight forward, both with kmeans and with Matlab's own selforgmap. The implementation you have posted for SOMSimple I cannot really comment on.

Let's take your initial example:

rng(1337);
T = 1000;
x_i = rand(1,T); %rowvector for convenience

Assuming you want to quantize to three symbols, your manual version could be:

nsyms = 3;
symsthresh = [1:-1/nsyms:1/nsyms];
x_i_q = zeros(size(x_i));

for i=1:nsyms
    x_i_q(x_i<=symsthresh(i)) = i;
end

Using Matlab's own selforgmap you can achieve a similar result:

net = selforgmap(nsyms);
net.trainParam.showWindow = false;
net = train(net,x_i);
net(x_i);
y = net(x_i);
classes = vec2ind(y);

Lastly, the same can be done straightforwardly with kmeans:

clusters = kmeans(x_i',nsyms)';

Question:

I have a multi-dimensional array of size 37759x4096. 37759 is the number of features observations and each feature is of size 4096.

These features are vgg features of images that I extracted for 37759 inages. I wanted to perform k-means clustering to see if they would group among same classes.

code snippet:

from sklearn.cluster import KMeans
import numpy as np

features = np.asarray(features) #converting list to features
kmeans = KMeans(n_clusters=17).fit(features)

output:

 In [26]: kmeans.labels_
Out[26]: array([ 0,  0,  0, ..., 11, 11, 11], dtype=int32)

In [27]: len(kmeans.labels_)
Out[27]: 37759

In [28]: kmeans.cluster_centers_
Out[28]: 
array([[  2.46095985e-01,  -4.32133675e-07,   6.41381502e-01, ...,
          9.16770659e-09,   2.39292532e-03,   9.38249767e-01],
       [  1.18244767e+00,   8.83443374e-03,   8.44059408e-01, ...,
          6.17001206e-09,   7.23063201e-03,   4.57734227e-01],
       [  5.05003333e-01,   2.45869160e-07,   1.07537758e+00, ...,
         -4.24915925e-09,   2.19564766e-01,   6.04652226e-01],
       ..., 
       [  2.72164375e-01,   7.94929452e-03,   8.18695068e-01, ...,
         -3.43425199e-09,   7.62813538e-03,   2.84249210e+00],
       [  1.03947210e+00,   1.03959814e-04,   7.81472027e-01, ...,
          7.42147677e-09,   1.28777415e-01,   8.22515607e-01],
       [  1.55310243e-01,   6.24559261e-02,   7.55328536e-01, ...,
         -3.84170562e-09,   2.09998786e-02,   4.18608427e-01]], dtype=float32)

First of all, since it is a high-dimensional data I am not sure if k-means is the best way to go about it. It classified only 11 clusters instead of 17. But anyway,

  1. How can we ensure that it is clustering the arrays row-wise(according to sample observations) and not columnwise(features)
  2. Features of same classes are stacked together, but we can see that in kmeans.cluster_centers_ the cluster centers are very different, inferring from the first three arrays
  3. How can I visualize this data? How do I find unique arrays?
  4. Do you have any lead on how can I perform clustering for very high dimensional data such as this?

Answer:

Clusters in kmeans can become empty and thus disappear.

If this happens, the initial centers were badly chosen, and often the result is not "stable". If you try different initial seeds, you will likely get very different results.

Clustering and visualizing such data is difficult, and you won't find an easy out-of-the-box solution.

Question:

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:

The most important benefit of this procedure is that computational load decreases considerably, making it possible to cluster large data sets and to consider several different preprocessing strategies in a limited time. Naturally, the approach is valid only if the clusters found using the SOM are similar to those of the original data.

In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:

  • Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
  • What characteristics should a dataset have, in general, for this approach to be useful?

Answer:

Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.

It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.

Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.