Quickly convert Pandas Series of labels into Series of indirect values from corresponding columns

pandas series set index
pandas set index
pandas merge
pandas groupby
pandas replace
pandas dataframe
pandas loc
pandas reindex

I have following example dataframe:

N = np.arange(1, 10)
df = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

I want to "dereference" ref column in some way, to get this:

    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'ind': [   1,   20,  300, 4000,  500,   60,    7,   80,  900],

So each value in ind should correspond to the value in column labeled from ref at the same position.

Naïve approach would be to use something like df[df['ref']], then multiply by identity matrix, then sum it column-wise. But because I have quite large (~8 GB) dataframe, doing this, I guess, would nearly square its size. And it just doesn't feel right.

Also due to the size just iterating over it is painfully slow. And I can't iterate with Cython, because converting this dataframe into numpy array loses label information, which I need to properly find the column.

Any suggestions?..


you can do it using DataFrame.mask or numpy where like below looks like numpy where performs slightly better in this dataset

N = np.arange(1, 10)
df_b = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

df_b

Using Pandas Where

%%timeit
df = df_b.copy()
cols = df.columns[1:]
df["ind"] = df["ref"]

for col in cols:
    df.ind.mask(df.ind==col, df[col], inplace=True)
df
## 6.73 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using Numpy's Where

%%timeit
df = df_b.copy()
arr = df.ref.values

cols = df.columns[1:]
for col in cols:
    arr2 = df[col].values
    arr = np.where(arr==col, arr2, arr)

df["ind"] = arr
df

## 1.21 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Result

    ref a   b   c   d   ind
0   a   1   10  100 1000    1
1   b   2   20  200 2000    20
2   c   3   30  300 3000    300
3   d   4   40  400 4000    4000
4   c   5   50  500 5000    500
5   b   6   60  600 6000    60
6   a   7   70  700 7000    7
7   b   8   80  800 8000    80
8   c   9   90  900 9000    900

Indexing and selecting data, Here we construct a simple time series data set to use for illustrating the to the object it modified, which in the fast of enlargement, will be a new object: You may find this useful for applying a transform (in-place) to a subset of the columns. as with an ndarray, returning a slice of the values and the corresponding labels​:. Combine Series values, choosing the calling Series’s values first. convert_dtypes (self, infer_objects, …) Convert columns to best possible dtypes using dtypes supporting pd.NA. copy (self, deep) Make a copy of this object’s indices and data. corr (self, other[, method, min_periods])


You could use numpy indexing:

lookup = dict(zip(df.columns, range(len(df.columns))))
result = pd.DataFrame({ 'ref' : df.ref, 'ind': df.values[np.arange(len(df)), df.ref.map(lookup)] })

print(result)

Output

  ref   ind
0   a     1
1   b    20
2   c   300
3   d  4000
4   c   500
5   b    60
6   a     7
7   b    80
8   c   900

pandas.Series.replace, For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings  pandas.Series.values¶ property Series.values¶. Return Series as ndarray or ndarray-like depending on the dtype.


Use pandas.lookup()

df['ind'] = df.lookup(df.index, df['ref'])

  ref  a   b    c     d   ind
0   a  1  10  100  1000     1
1   b  2  20  200  2000    20
2   c  3  30  300  3000   300
3   d  4  40  400  4000  4000
4   c  5  50  500  5000   500
5   b  6  60  600  6000    60
6   a  7  70  700  7000     7
7   b  8  80  800  8000    80
8   c  9  90  900  9000   900

pandas.Series.reindex, Series.transform · pandas.Series.map Places NA/NaN in locations having no value in the previous index. A new Remove row labels or move them to new columns. By default values in the new index that do not have corresponding records in the dataframe are assigned NaN . We can also reindex the columns. Using Series() method with index parameter.. In this case, the values in data corresponding to the labels in the index will be assigned. Code #1 : Index list is passed of same length as the number of keys present in dictionary.


pandas.Series.where, Series.transform · pandas. Where False, replace with corresponding value from other . For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used. DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> df A B 0 0 1  The callable must not change input Series/DataFrame (though pandas doesn’t check it). other scalar, Series/DataFrame, or callable Entries where cond is False are replaced with corresponding value from other .


pandas.Series, Return boolean if values in the object are unique. loc. Access a group of rows and columns by label(s) or a boolean array. pandas.Series.value_counts¶ Series. value_counts ( self , normalize = False , sort = True , ascending = False , bins = None , dropna = True ) [source] ¶ Return a Series containing counts of unique values.


[PDF] Python for Data Analysis, Should you have data contained in a Python dict, you can create a Series from it by passing the dict: A matrix of data, passing optional row and column labels. A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.). A list or array of labels ['a', 'b', 'c']. A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start and the stop are