## Correlation coefficients for sparse matrix in python?

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like `numpy.corrcoef`

that will work on a scipy sparse matrix.

Just using numpy:

import numpy as np C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense() V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C))) COV = np.divide(C,V+1e-119)

**Calculate Pairwise Pearson Correlation with Sparse Matrix � Issue ,** I have a fairly large sparse matrix that I would like to compute the pairwise Pearson correlation coefficients for using Dask Distributed. from dask.distributed import Client, wait import scipy.sparse import sparse # Need to install� The Correlation matrix is an important data analysis metric that is computed to summarize data to understand the relationship between various variables and make decisions accordingly. It is also an important pre-processing step in Machine Learning pipelines to compute and analyze the correlation matrix where dimensionality reduction is desired

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

import numpy as np from scipy import sparse def sparse_corrcoef(A, B=None): if B is not None: A = sparse.vstack((A, B), format='csr') A = A.astype(np.float64) n = A.shape[1] # Compute the covariance matrix rowsum = A.sum(1) centering = rowsum.dot(rowsum.T.conjugate()) / n C = (A.dot(A.T.conjugate()) - centering) / (n - 1) # The correlation coefficients are given by # C_{i,j} / sqrt(C_{i} * C_{j}) d = np.diag(C) coeffs = C / np.sqrt(np.outer(d, d)) return coeffs

Check that it works OK:

# some smallish sparse random matrices a = sparse.rand(100, 100000, density=0.1, format='csr') b = sparse.rand(100, 100000, density=0.1, format='csr') coeffs1 = sparse_corrcoef(a, b) coeffs2 = np.corrcoef(a.todense(), b.todense()) print(np.allclose(coeffs1, coeffs2)) # True

##### Be warned:

The amount of memory required for computing the covariance matrix `C`

will be heavily dependent on the sparsity structure of `A`

(and `B`

, if given). For example, if `A`

is an `(m, n)`

matrix containing just a *single* column of non-zero values then `C`

will be an `(n, n)`

matrix containing *all* non-zero values. If `n`

is large then this could be very bad news in terms of memory consumption.

**[SciPy-user] sparse version of stats.pearsonr ?,** """Calculates a Pearson correlation coefficient and the p-value for testing non- correlation using <4x4 sparse matrix of type '<type 'numpy.int32'>' with 4 stored � Python implementation of the HiCRep: a stratum-adjusted correlation coefficient (SCC) for Hi-C data with support for Cooler sparse contact matrices. The algorithm is published in. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.

Unfortunately, Alt's answer didn't work out for me. The values given to the `np.sqrt`

function where mostly negative, so the resulting covariance values were nan.

I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the `centering = rowsum.dot(rowsum.T.conjugate()) / n`

matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)

Instead, I used scikit-learn's `StandardScaler`

to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.

from sklearn.preprocessing import StandardScaler def compute_sparse_correlation_matrix(A): scaler = StandardScaler(with_mean=False) scaled_A = scaler.fit_transform(A) # Assuming A is a CSR or CSC matrix corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A) return corr_matrix

I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

**[SciPy-user] sparse version of stats.pearsonr ?,** coefficient and the p-value for > testing > non-correlation using two sparse vectors as inputs. > > Parameters > ---------- > x : 1D sparse array� I’ll also review the steps to display the matrix using Seaborn and Matplotlib. To start, here is a template that you can apply in order to create a correlation matrix using pandas: df.corr() Next, I’ll show you an example with the steps to create a correlation matrix for a given dataset. Steps to Create a Correlation Matrix using Pandas

**corSparse: Pearson correlation between columns (sparse matrices ,** This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach � To be in favorable recovery conditions, we sample the data from a model with a sparse inverse covariance matrix. In addition, we ensure that the data is not too much correlated (limiting the largest coefficient of the precision matrix) and that there a no small coefficients in the precision matrix that cannot be recovered.

**numpy.corrcoef — NumPy v1.19 Manual,** The relationship between the correlation coefficient matrix, R, and the covariance matrix, C, is. R_{ij} = \frac{ C_{ij} } { \sqrt. The values of R are� Pandas computes correlation coefficient between the columns present in a dataframe instance using the correlation() method. It computes Pearson correlation coefficient, Kendall Tau correlation coefficient and Spearman correlation coefficient based on the value passed for the method parameter.

**Computing Very Large Correlation Matrices in Parallel,** To speed up the computation of the correlation coefficients later on, we whiten each variable. # create observations from numpy.random import RandomState prng� We can also use NumPy to compute Pearson correlation coefficient. NumPy’s corrcoef() function can take multiple variables as 2D NumPy array and return correlation matrix. np.corrcoef(gapminder.gdpPercap, gapminder.lifeExp) In the simplest case with two variables it returns a 2×2 matrix with Pearson correlation values. array([[1.

##### Comments

- This is a good response. It produces a dense covariance matrix, but never changes the sparsity pattern of the input matrix.
- This is an excellent response for the reason @joeln mentioned. I needed the covariance matrix for a massive dataset so that I could look for multicollinearity among the features using eigendecomposition. Well done.
- Very helpful thanks... I've modifed the numpy corrcoef method for memory savings. I suggest the following: use these type of operations, A -= A or A /= A. Convert to np.float32, BLAS has functions for both 64 and 32 bit floats. Drop the conjugate unless you need it. One reason the memory is so bad is likely b/c the dot product routine expects square-ish matrices and actually pads out with zeros for optimization. Former boss was very good with computational C.
- @wbg The issue I was pointing out is that, depending on its sparsity structure, even a very sparse array can still have a very dense covariance matrix. This is not really a problem with the implementation details, but is rather a fundamental issue when computing the covariance matrices of sparse arrays. There are a few work-arounds, for example by computing a truncated version of the covariance matrix by imposing a penalty on its L1 norm (e.g. here).
- Thx for rhe tip abt using a penalty. I thini this topic is rich for study. Cheers.
- Unless the data are already centred,
`A = A - A.mean(1)`

will destroy any sparsity. You may as well just convert to dense first! - @joeln Good point - I've updated my answer to avoid doing this.
- Unless you are complex numbers (which is not the case here), in V=np.sqrt(np.mat(np.diag(C)).T
*np.mat(np.diag(C))), np.mat(np.diag(C)).T*np.mat(np.diag(C)) will only have non-negative entries. Because it multiplies a diagonal matrix by itself. so each entry in np.diag(C) will be squared. I would debug my code, there is a chance that something else is going wrong in your code.