Repeat a scipy csr sparse matrix along axis 0

scipy sparse matrix multiplication
sort sparse matrix python
scipy sparse multiply matrix
compressed sparse row format to numpy array
scipy csr to csc
scipy matrices
numpy sparse tensor
matlab sparse in python

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.

I wrote the following code that operates on the internal data, which seems to work

def csr_repeat(csr, repeats):
    if isinstance(repeats, int):
        repeats = np.repeat(repeats, csr.shape[0])
    repeats = np.asarray(repeats)
    rnnz = np.diff(csr.indptr)
    ndata = rnnz.dot(repeats)
    if ndata == 0:
        return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
                                 dtype=csr.dtype)
    indmap = np.ones(ndata, dtype=np.int)
    indmap[0] = 0
    rnnz_ = np.repeat(rnnz, repeats)
    indptr_ = rnnz_.cumsum()
    mask = indptr_ < ndata
    indmap -= np.int_(np.bincount(indptr_[mask],
                                  weights=rnnz_[mask],
                                  minlength=ndata))
    jumps = (rnnz * repeats).cumsum()
    mask = jumps < ndata
    indmap += np.int_(np.bincount(jumps[mask],
                                  weights=rnnz[mask],
                                  minlength=ndata))
    indmap = indmap.cumsum()
    return sparse.csr_matrix((csr.data[indmap],
                              csr.indices[indmap],
                              np.r_[0, indptr_]),
                             shape=(np.sum(repeats), csr.shape[1]))

and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?

Edit

As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.

from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row

scipy.sparse.csr_matrix, with another sparse matrix S (equivalent to S.tocsr()) is the standard CSR representation where the column indices for row i are stored in indices[indptr[i]:​indptr[i+1]] csr_matrix((data, indices, indptr), dtype=int).toarray() array([[2, 1, 0, 0], [0, 1, 1, 1]]) Return indices of maximum elements along an axis. Returns a copy of row i of the matrix, as a (1 x n) CSR matrix (row vector). log1p Element-wise log1p. max ([axis, out]) Return the maximum of the matrix or maximum along an axis. maximum (other) Element-wise maximum between this and another matrix. mean ([axis, dtype, out]) Compute the arithmetic mean along the specified axis. min ([axis, out])

It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).

In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.

I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.

I worked out a similar function for the lil format (working from lil.copy):

def lil_repeat(S, repeat):
    # row repeat for lil sparse matrix
    # test for lil type and/or convert
    shape=list(S.shape)
    if isinstance(repeat, int):
        shape[0]=shape[0]*repeat
    else:
        shape[0]=sum(repeat)
    shape = tuple(shape)
    new = sparse.lil_matrix(shape, dtype=S.dtype)
    new.data = S.data.repeat(repeat) # flat repeat
    new.rows = S.rows.repeat(repeat)
    return new

But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:

S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A

give the same result

and best of all?

print S[np.arange(3).repeat([1,2,3]),:].A

scipy.sparse.bsr_matrix, BSR is appropriate for sparse matrices with dense sub matrices like the last example below. more efficient than CSR and CSC for many sparse arithmetic operations. 2, 2, 0, 1, 2]) >>> data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape​(6, 2, argmax ([axis, out]), Return indices of minimum elements along an axis. Format of a matrix representation as a string. Maximum number of elements to display when printed. Number of stored values, including explicit zeros. Returns a copy of row i of the matrix, as a (1 x n) CSR matrix (row vector). Element-wise log1p. Return the maximum of the matrix or maximum along an axis.

After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.

repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))

This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.

I think the marked answer is probably still the "best" way to do this.

scipy.sparse.csr_matrix, Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, Disadvantages of the CSR format csr_matrix( (3,4)​, dtype=int8 ).todense() matrix([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], dtype=int8) max([axis]), Maximum of the elements of this matrix. Last updated on May 11, 2014. scipy.sparse.bsr_matrix format is very similar to the Compressed Sparse Row (CSR) format. Return the maximum of the matrix or maximum along an axis.

scipy.sparse.dia_matrix, with another sparse matrix S (equivalent to S.todia()) data = np.array([[1, 2, 3, 4​]]).repeat(3, axis=0) >>> offsets = np.array([0, -1, 2]) >>> dia_matrix((data, offsets), Compute the arithmetic mean along the specified axis. Sum the matrix elements over a given axis. Parameters axis {-2, -1, 0, 1, None} optional. Axis along which the sum is computed. The default is to compute the sum of all the matrix elements, returning a scalar (i.e. axis = None). dtype dtype, optional. The type of the returned matrix and of the accumulator in which the elements are summed.

scipy.sparse.coo_matrix, COO is a fast format for constructing sparse matrices Constructing a matrix with duplicate indices >>> row = np.array([0, 0, 1, 3, 1, 0, 0]) >>> col = np.array([0, 2, 1, 3, 1, 0, Return indices of maximum elements along an axis. scipy.sparse.csr_matrix.mean¶. Compute the arithmetic mean along the specified axis. Returns the average of the matrix elements. The average is taken over all elements in the matrix by default, otherwise over the specified axis. float64 intermediate and return values are used for integer inputs.

scipy.sparse.bsr_matrix, BSR is appropriate for sparse matrices with dense sub matrices like In such cases, BSR is considerably more efficient than CSR and CSC for many sparse arithmetic operations. 2, 2, 0, 1, 2]) >>> data = np.array([1, 2, 3, 4, 5, 6]).repeat(​4).reshape(6, Return indices of maximum elements along an axis. cupyx.scipy.sparse.spmatrix axis (int or None) – Axis along which the sum is comuted. Convert this matrix to Compressed Sparse Row format.

Comments
  • Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others. Thanks for improving the answer's reference value and making it more understandable!
  • this answer only works when the sparse matrix to be repeated is actually a sparse vector. just using basic linear algebra...
  • @user3357359 if you're repeating a sparse vector, it's seems much quicker to just do something like sparse_row[np.zeros(repeat_number),:]
  • S[np.arange(3).repeat([1,2,3]),:] is genius, and in some quick testing was also much faster than my method.