## Correlation coefficients for sparse matrix in python?

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

Just using numpy:

import numpy as np
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COV = np.divide(C,V+1e-119)


Calculate Pairwise Pearson Correlation with Sparse Matrix � Issue , I have a fairly large sparse matrix that I would like to compute the pairwise Pearson correlation coefficients for using Dask Distributed. from dask.distributed import Client, wait import scipy.sparse import sparse # Need to install� The Correlation matrix is an important data analysis metric that is computed to summarize data to understand the relationship between various variables and make decisions accordingly. It is also an important pre-processing step in Machine Learning pipelines to compute and analyze the correlation matrix where dimensionality reduction is desired

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

import numpy as np
from scipy import sparse

def sparse_corrcoef(A, B=None):

if B is not None:
A = sparse.vstack((A, B), format='csr')

A = A.astype(np.float64)
n = A.shape[1]

# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))

return coeffs


Check that it works OK:

# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')

coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())

print(np.allclose(coeffs1, coeffs2))
# True

##### Be warned:

The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

[SciPy-user] sparse version of stats.pearsonr ?, """Calculates a Pearson correlation coefficient and the p-value for testing non- correlation using <4x4 sparse matrix of type '<type 'numpy.int32'>' with 4 stored � Python implementation of the HiCRep: a stratum-adjusted correlation coefficient (SCC) for Hi-C data with support for Cooler sparse contact matrices. The algorithm is published in. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.

Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt function where mostly negative, so the resulting covariance values were nan.

I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)

Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.

from sklearn.preprocessing import StandardScaler

def compute_sparse_correlation_matrix(A):
scaler = StandardScaler(with_mean=False)
scaled_A = scaler.fit_transform(A)  # Assuming A is a CSR or CSC matrix
corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A)
return corr_matrix


I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

[SciPy-user] sparse version of stats.pearsonr ?, coefficient and the p-value for > testing > non-correlation using two sparse vectors as inputs. > > Parameters > ---------- > x : 1D sparse array� I’ll also review the steps to display the matrix using Seaborn and Matplotlib. To start, here is a template that you can apply in order to create a correlation matrix using pandas: df.corr() Next, I’ll show you an example with the steps to create a correlation matrix for a given dataset. Steps to Create a Correlation Matrix using Pandas

corSparse: Pearson correlation between columns (sparse matrices , This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach � To be in favorable recovery conditions, we sample the data from a model with a sparse inverse covariance matrix. In addition, we ensure that the data is not too much correlated (limiting the largest coefficient of the precision matrix) and that there a no small coefficients in the precision matrix that cannot be recovered.

numpy.corrcoef — NumPy v1.19 Manual, The relationship between the correlation coefficient matrix, R, and the covariance matrix, C, is. R_{ij} = \frac{ C_{ij} } { \sqrt. The values of R are� Pandas computes correlation coefficient between the columns present in a dataframe instance using the correlation() method. It computes Pearson correlation coefficient, Kendall Tau correlation coefficient and Spearman correlation coefficient based on the value passed for the method parameter.

Computing Very Large Correlation Matrices in Parallel, To speed up the computation of the correlation coefficients later on, we whiten each variable. # create observations from numpy.random import RandomState prng� We can also use NumPy to compute Pearson correlation coefficient. NumPy’s corrcoef() function can take multiple variables as 2D NumPy array and return correlation matrix. np.corrcoef(gapminder.gdpPercap, gapminder.lifeExp) In the simplest case with two variables it returns a 2×2 matrix with Pearson correlation values. array([[1.

• Unless the data are already centred, A = A - A.mean(1) will destroy any sparsity. You may as well just convert to dense first!