## Using a sparse matrix versus numpy array

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:

Wikipedia says

a sparse matrix is a matrix in which most of the elements are zero

Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

- How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?
- So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.

Any help is much appreciated!

The `scipy`

sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the `dot`

product for numpy arrays) and equation solvers are well developed.

My rough experience is that a sparse `csr`

matrix product has to have a 1% sparsity to be faster than the equivalent dense `dot`

operation - in other words, one nonzero value for every 99 zeros. (but see tests below)

But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the `coo`

format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

The `scipy`

package implements many sparse formats. The `coo`

format is easiest to understand and build. Build one according to documentation and look at its `.data`

, `.row`

, and `.col`

attributes (3 1d arrays).

`csr`

and `csc`

are typically built from the `coo`

format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

It is also possible to index `csr`

format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

`lil`

(lists of lists) is also easy to understand, and best for incremental building. `dok`

is a actually a dictionary subclass.

A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the `np.matrix`

class (though it isn't a subclass).

A search for other questions using `scikit-learn`

and `sparse`

might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the `learn`

side. So far the `sparse`

package has not been optimized for this application.

I just tried some matrix product tests, using the `sparse.random`

method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

In [251]: M=sparse.random(1000,1000,.5) In [252]: timeit M1=M*M 1 loops, best of 3: 2.78 s per loop In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1 loops, best of 3: 4.28 s per loop

It is a size issue; for smaller matrix the dense `dot`

is faster

In [255]: M=sparse.random(100,100,.5) In [256]: timeit M1=M*M 100 loops, best of 3: 3.24 ms per loop In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1000 loops, best of 3: 1.44 ms per loop

But compare indexing

In [268]: timeit M.tocsr()[500,500] 10 loops, best of 3: 86.4 ms per loop In [269]: timeit Ma[500,500] 1000000 loops, best of 3: 318 ns per loop In [270]: timeit Ma=M.toarray();Ma[500,500] 10 loops, best of 3: 23.6 ms per loop

**A Gentle Introduction to Sparse Matrices for Machine Learning,** Sparse matrices can cause problems with regards to space and time stored in a NumPy array can be converted into a sparse matrix using the� To efficiently represent a sparse matrix, CSR uses three numpy arrays to store some relevant information, including: data: the values of the non-zero values — these are the non-zero values that are stored within the sparse matrix indices: an array of column indices — starting from the first row

a sparse matrix is a matrix in which most of the elements are zero Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

There is no general rule. It solely depends on your exact usage later on. You have to compute complexity of the model based on sparse matrix and without, and then you can find the "sweet spot". This will depend on both number of samples and dimension. In general, it often boils down to matrix multiplications of the form

X' W

where X is data matrix N x d, and W is some weight matrix d x K. Consequently "dense" multiplication takes `NdK`

time, while sparse, assuming that your average per-row sparsity is p is `NpdK`

. Thus if your sparsity is 50% you can expect nearly 2x faster operation. The harder part is to estimate the overhead of sparse access as opposed to heavily optimized dense based.

How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?

For a particular case of LR, this can be even few times faster than dense format, but in order to observe the difference you need lots of data (>1000) of high dimension (>100).

So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.

No, it is not a good approach. You can build it "from scratch" by for example first building a dictionary and then converting it etc. there are plenty of ways to construct sparse matrix without a dense one in the first place.

**Sparse matrices (scipy.sparse),** SciPy 2-D sparse matrix package for numeric data. Generate a sparse matrix of the given shape and density with uniformly distributed values. for the given sparse matrix class, or convert the sparse matrix to a NumPy array (e.g., using the � It makes sense that operations between like types return the same type as the input type. However, it would be nice if operations between numpy.arrays and scipy.sparse matrices would return always the same array type.

@hpaulj Your timeit is wrong, u are getting slow results cause of mapping sparse.random to numpy array (its slowish) with that in mind:

M=sparse.random(1000,1000,.5) Ma=M.toarray() %timeit -n 25 M1=M*M 352 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 25 loops each) %timeit -n 25 M2=Ma.dot(Ma) 13.5 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)

To get close to numpy we need to have

M=sparse.random(1000,1000,.03) %timeit -n 25 M1=M*M 10.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 25 loops each) %timeit -n 25 M2=Ma.dot(Ma) 11.4 ms ± 564 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)

**Introduction to Sparse Matrices in Python with SciPy,** Learning to work with Sparse matrix, a large matrix or 2d-array with a lot elements being zero, can be extremely handy. Python's SciPy library� Using a sparse matrix versus numpy array. 2. Cannot replicate results comparing Python, Numpy and Numba matrix multiplication. 1. Remove nan rows in a scipy sparse

**Operations between numpy.array and scipy.sparse matrix return ,** It appears that adding or subtracting numpy.ndarrays with scipy.sparse matrices returns a numpy.matrix. Is the inconsistency in returned array� To do a vector product between a sparse matrix and a vector simply use the matrix dot method, as described in its docstring: >>> import numpy as np >>> from scipy.sparse import csr_matrix >>> A = csr_matrix ([[ 1, 2, 0 ], [ 0, 0, 3 ], [ 4, 0, 5 ]]) >>> v = np. array ([ 1, 0, - 1 ]) >>> A. dot (v) array([ 1, -3, -1], dtype=int64)

**Why We Use Sparse Matrices for Recommender Systems,** When we run matrix computations and we want to store those sparse matrices as a Numpy array or Pandas DataFrame, they consume memory� I really couldn't google it. How to transform sparse matrix to ndarray? Assume, I have sparse matrix t of zeros. Then g = t.todense() g[:10] matrix([[0], [0], [0], [0

**Sparse data structures in Python,** Let's create a random sparse matrix and compare its size to an lil_matrix in scipy), which uses two numpy arrays with regular Python lists� The provided array must have the same shape and dtype as the sparse matrix on which you are calling the method. For most sparse types, out is required to be memory contiguous (either C or Fortran ordered). Returns arr ndarray, 2-D. An array with the same shape and containing the same data represented by the sparse matrix, with the requested

##### Comments

- As an added note, the scipy documentation patrick links to actually has a few examples at the bottom of how to build a sparse matrix from scratch.
- i know you're not supposed to post "thank you" comments, but that is an awesome answer & a very helpful comment. thanks guys.