Convert numpy rows into columns based on ID

numpy matrix multiplication
numpy transpose
numpy identity matrix
numpy array
numpy concatenate
numpy append
numpy shape
numpy inverse

Suppose I have a numpy array that maps between IDs of two item types:

[[1, 12],
 [1, 13],
 [1, 14],
 [2, 13],
 [2, 14],
 [3, 11]]

I would like to rearrange this array such that each row in the new array represents all items that matched the same ID in the original array. Here, each column would represent one of the mappings in the original array, up to a specified shape restriction on the number of columns in the new array. If we wanted to obtain this result from the above array, ensuring we only had 2 columns, we would obtain:

[[12, 13],  #Represents 1 - 14 was not kept as only 2 columns are allowed
 [13, 14],  #Represents 2
 [11,  0]]  #Represents 3 - 0 was used as padding since 3 did not have 2 mappings

The naïve approach here would be to use a for-loop that populates the new array as it encounters rows in the original array. Is there a more efficient means of accomplishing this with numpy's functionality?

Here is an approach using a sparse matrix:

def pp(map_, maxitems=2):
    M = sparse.csr_matrix((map_[:, 1], map_[:, 0], np.arange(map_.shape[0]+1)))
    M = M.tocsc()
    sizes = np.diff(M.indptr)
    ids, = np.where(sizes)
    D = np.concatenate([M.data, np.zeros((maxitems - 1,), dtype=M.data.dtype)])
    D = np.lib.stride_tricks.as_strided(D, (D.size - maxitems + 1, maxitems),
                                        2 * D.strides)
    result = D[M.indptr[ids]]
    result[np.arange(maxitems) >= sizes[ids, None]] = 0
    return result

Timings using @crisz's code but modified to use less repetitive test data. Also I added a bit of "validation": chrisz's and my solutions give the same answer, the other two output a different format, so I couldn't check them.

Code:

from scipy import sparse
import numpy as np
from collections import defaultdict, deque

def pp(map_, maxitems=2):
    M = sparse.csr_matrix((map_[:, 1], map_[:, 0], np.arange(map_.shape[0]+1)))
    M = M.tocsc()
    sizes = np.diff(M.indptr)
    ids, = np.where(sizes)
    D = np.concatenate([M.data, np.zeros((maxitems - 1,), dtype=M.data.dtype)])
    D = np.lib.stride_tricks.as_strided(D, (D.size - maxitems + 1, maxitems),
                                        2 * D.strides)
    result = D[M.indptr[ids]]
    result[np.arange(maxitems) >= sizes[ids, None]] = 0
    return result

def chrisz(a):
  return [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]

def piotr(a):
  d = defaultdict(lambda: deque((0, 0), maxlen=2))
  for key, val in a:
    d[key].append(val)
  return d

def karams(arr):
  cols = arr.shape[1]
  ids = arr[:, 0]
  inds = np.where(np.diff(ids) != 0)[0] + 1
  sp = np.split(arr[:,1:], inds)
  result = [a[:2].ravel() if a.size >= cols else np.pad(a.ravel(), (0, cols -1 * (cols - a.size)), 'constant')for a in sp]
  return result

def make(nid, ntot):
    return np.c_[np.random.randint(0, nid, (ntot,)),
                 np.random.randint(0, 2**30, (ntot,))]

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['pp', 'chrisz', 'piotr', 'karams'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],# 50000],
       dtype=float
)

for c in res.columns:
#        l = np.repeat(np.array([[1, 12],[1, 13],[1, 14],[2, 13],[2, 14],[3, 11]]), c, axis=0)
    l = make(c // 2, c * 6)
    assert np.all(chrisz(l) == pp(l))
    for f in res.index:
        stmt = '{}(l)'.format(f)
        setp = 'from __main__ import l, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

numpy.identity, Parameters: n : int. Number of rows (and columns) in n x n output. dtype : data-​type, optional. Data-type of the output. Defaults to float . Returns: out : ndarray. Convert rows to columns based on Unique ID Hi, I'm new to the forum, so sorry if this has been discussed already, but I have a question concerning a dataset with unique IDs and several rows of data for each.

Here is a general and mostly Numpythonic approach:

In [144]: def array_packer(arr):
     ...:     cols = arr.shape[1]
     ...:     ids = arr[:, 0]
     ...:     inds = np.where(np.diff(ids) != 0)[0] + 1
     ...:     sp = np.split(arr[:,1:], inds)
     ...:     result = [np.unique(a[: cols]) if a.shape[0] >= cols else
     ...:                    np.pad(np.unique(a), (0, (cols - 1) * (cols - a.shape[0])), 'constant')
     ...:                 for a in sp]
     ...:     return result
     ...:     
     ...:     

Demo:

In [145]: a = np.array([[1, 12, 15, 45],
     ...:  [1, 13, 23, 9],
     ...:  [1, 14, 14, 11],
     ...:  [2, 13, 90, 34],
     ...:  [2, 14, 23, 43],
     ...:  [3, 11, 123, 53]])
     ...:  

In [146]: array_packer(a)
Out[146]: 
[array([ 9, 11, 12, 13, 14, 15, 23, 45,  0,  0,  0]),
 array([13, 14, 23, 34, 43, 90,  0,  0,  0,  0,  0,  0]),
 array([ 11,  53, 123,   0,   0,   0,   0,   0,   0,   0,   0,   0])]

In [147]: a = np.array([[1, 12, 15],
     ...:  [1, 13, 23],
     ...:  [1, 14, 14],
     ...:  [2, 13, 90],
     ...:  [2, 14, 23],
     ...:  [3, 11, 123]])
     ...: 
     ...:   
     ...:  

In [148]: array_packer(a)
Out[148]: 
[array([12, 13, 14, 15, 23]),
 array([13, 14, 23, 90,  0,  0]),
 array([ 11, 123,   0,   0,   0,   0])]

numpy.ndarray.transpose, To convert a 1-D array into a 2D column vector, an additional dimension must be added. np.atleast2d(a).T achieves this, as does a[:, np.newaxis]. For a 2-D  Moving Data from Multiple Rows into One Row Based on Unique ID I would like for all of the information about John Doe to be in 1 row but in all separate columns.

For this problem the naive for-loop is actually quite an efficient solution:

from collections import defaultdict, deque
d = defaultdict(lambda: deque((0, 0), maxlen=2))

%%timeit
for key, val in a:
    d[key].append(val)
4.43 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result: {1: deque([13, 14]), 2: deque([13, 14]), 3: deque([0, 11])}

For comparison, a numpy solution proposed in this thread is 4 times slower:

%timeit [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]
18.6 µs ± 336 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numpy is great and I use it a lot myself, but I think it this case it is cumbersome.

numpy.ndarray.flatten, 'A' means to flatten in column-major order if a is Fortran contiguous in memory, row-major order otherwise. 'K' means to flatten a in the order the elements occur  If you want to convert rows to columns, you really should use pivot! This helps, but dynamic columns are still a challenge. You can do it with XML pivoting:

Slightly adapted from the almost duplicate to pad and select only two elements:

[[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]

Output:

[[12, 13], [13, 14], [11, 0]]

If you want to keep track of keys:

{i:[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])}

# {1: [12, 13], 2: [13, 14], 3: [11, 0]}

Functions

def chrisz(a):
  return [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]

def piotr(a):
  d = defaultdict(lambda: deque((0, 0), maxlen=2))
  for key, val in a:
    d[key].append(val)
  return d

def karams(arr):
  cols = arr.shape[1]
  ids = arr[:, 0]
  inds = np.where(np.diff(ids) != 0)[0] + 1
  sp = np.split(arr[:,1:], inds)
  result = [a[:2].ravel() if a.size >= cols else np.pad(a.ravel(), (0, cols -1 * (cols - a.size)), 'constant')for a in sp]
  return result

Timings

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['chrisz', 'piotr', 'karams'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
       dtype=float
)

for f in res.index:
    for c i

n res.columns:
        l = np.repeat(np.array([[1, 12],[1, 13],[1, 14],[2, 13],[2, 14],[3, 11]]), c, axis=0)
        stmt = '{}(l)'.format(f)
        setp = 'from __main__ import l, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Results (Clearly @Kasramvd is the winner):

NumPy Cheat Sheet, Download a free NumPy Cheatsheet to help you work with data in Python. which many important Python data science libraries are built, including Pandas, SciPy and scikit-learn. np.eye(5) | 5 x 5 array of 0 with 1 on diagonal (Identity matrix) arr.reshape(3,4) | Reshapes arr to 3 rows, 4 columns without changing data Dear All, Today my friend asked me to change the numpy array's columns into rows and rows into columns. Then I searched in numpy documents and I got this simple way to do that.

NumPy Tutorial: Data analysis with Python – Dataquest, There are 1600 rows in the file, including a header row, and 12 columns. NumPy will automatically pick a data type for the elements in an array based on their  In this article, we will discuss different ways to convert a dataframe column into a list. Fits of all, create a dataframe object that we are going to use in this example, Now how to fetch a single column out of this dataframe and convert it to a python list? There are different ways to do that, lets discuss them one by one.

(Tutorial) Python NUMPY Array TUTORIAL, Remember that NumPy also allows you to create an identity array or Here, instead of selecting elements, rows or columns based on index  If you have a worksheet with data in columns that you need to rotate to rearrange it in rows, use the Transpose feature. With it, you can quickly switch data from columns to rows, or vice versa. For example, if your data looks like this, with Sales Regions in the column headings and and Quarters along the left side:

1. Vectors, Matrices, and Arrays, NumPy is the foundation of the Python machine learning stack. In our solution, the matrix contains three rows and two columns (a column of 1s To see this in action, we can multiply a matrix by its inverse and the result is the identity matrix: You can import data in a data frame, join frames together, filter rows and columns and export the results in various file formats. Here is a pandas cheat sheet of the most common data operations: Getting Started. Import Pandas & Numpy. import numpy as np import pandas as pd . Get the first 5 rows in a dataframe: df.head(5)

Comments
  • Very similar question to: stackoverflow.com/questions/38013778/… but not an exact duplicate.
  • I had a suspicion that my timings were good on repetitive results since it only really looped three times. Nice approach!
  • Nice algorithmic vision!
  • Regarding your benchmarks please run the benchmarks with large arrays as well. Also note that you need to pad the arrays with zero instead of None. Plus padding should start from right.
  • I changed padding from None to 0. Thanks for pointing that out @Kasramvd!
  • I just timed this on a list 10000 times as large and the numpy approach was much faster, I'd appreciate someone else to validate the timings so it doesn't seem biased.