How to reverse sklearn.OneHotEncoder transform to recover original data?

reverse one-hot encoding numpy
reverse one hot encoding r
inverse transform sklearn
onehotencoder(drop first)
onehotencoder' object has no attribute drop
decode one hot encoding python
sklearn pipeline one-hot encoding
sklearn categoricalencoder

I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.

Is there a way to reverse the encoding and convert my output back to its original state?

A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.

X = np.array([
    [3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
    [5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T
n_values_

Lines 1763-1786 determine the n_values_ parameter. This will be determined automatically if you set n_values='auto' (the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:

n_samples, n_features = X.shape    # 10, 2
n_values = np.max(X, axis=0) + 1   # [100, 21]
self.n_values_ = n_values
feature_indices_

Next the feature_indices_ parameter is calculated.

n_values = np.hstack([[0], n_values])  # [0, 100, 21]
indices = np.cumsum(n_values)          # [0, 100, 121]
self.feature_indices_ = indices

So feature_indices_ is merely the cumulative sum of n_values_ with a 0 prepended.

Sparse Matrix Construction

Next, a scipy.sparse.coo_matrix is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.

column_indices = (X + indices[:-1]).ravel()
# array([  3, 105,  10, 101,  15, 103,  33, 107,  54, 108,  55, 112,  78, 115,  79, 119,  80, 120,  99, 108])

row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)

data = np.ones(n_samples * n_features)
# array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 1.,  1.,  1.,  1.,  1.,  1.,  1.])

out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>

Note that the coo_matrix is immediately converted to a scipy.sparse.csr_matrix. The coo_matrix is used as an intermediate format because it "facilitates fast conversion among sparse formats."

active_features_

Now, if n_values='auto', the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix is returned if sparse=True, otherwise it is densified before returning.

if self.n_values == 'auto':
    mask = np.array(out.sum(axis=0)).ravel() != 0
    active_features = np.where(mask)[0]  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
    out = out[:, active_features]  # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
    self.active_features_ = active_features

return out if self.sparse else out.toarray()
Decoding

Now let's work in reverse. We'd like to know how to recover X given the sparse matrix that is returned along with the OneHotEncoder features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder and running fit_transform on our data X.

from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder()  # all default params
out = ohc.fit_transform(X)

The key insight to solving this problem is understanding the relationship between active_features_ and out.indices. For a csr_matrix, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices method.

out.indices  # array([12,  0, 10,  1, 11,  2, 13,  3, 14,  4, 15,  5, 16,  6, 17,  7, 18, 8, 14,  9], dtype=int32)
out = out.sorted_indices()
out.indices  # array([ 0, 12,  1, 10,  2, 11,  3, 13,  4, 14,  5, 15,  6, 16,  7, 17,  8, 18,  9, 14], dtype=int32)

We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.

Next we look at active_features_:

ohc.active_features_  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])

Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1].

Looking back at out.indices, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_ correspond to the column numbers in ohc.indices. With this, we can decode:

import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)

This gives us:

array([[  3, 105],
       [ 10, 101],
       [ 15, 103],
       [ 33, 107],
       [ 54, 108],
       [ 55, 112],
       [ 78, 115],
       [ 79, 119],
       [ 80, 120],
       [ 99, 108]])

And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_:

recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3,  5],
       [10,  1],
       [15,  3],
       [33,  7],
       [54,  8],
       [55, 12],
       [78, 15],
       [79, 19],
       [80, 20],
       [99,  8]])

Note that you will need to have the original shape of X, which is simply (n_samples, n_features).

TL;DR

Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
                .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]

python - How to reverse sklearn.OneHotEncoder transform to , Just compute the dot-product of the encoded values with enc.active_features_. It would work both for sparse and dense representation. How to reverse sklearn.OneHotEncoder transform to recover original data? 2 How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

Just compute dot-product of the encoded values with ohe.active_features_. It works both for sparse and dense representation. Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

ohe = OneHotEncoder()
encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise

decoded = encoded.dot(ohe.active_features_).astype(int)
assert np.allclose(orig, decoded)

The key insight is that the active_features_ attribute of the OHE model represents the original values for each binary column. Thus we can decode the binary-encoded number by simply computing a dot-product with active_features_. For each data point there's just a single 1 the position of the original value.

One Hot Encoding in Scikit-Learn: How to feed the encoded data , recover original data? reverse one-hot encoding reverse one-hot encoding numpy reverse one hot encoding r inverse transform sklearn onehotencoder drop​  Mack. Data Scientist at Research Innovations, Inc. 22 How to reverse sklearn.OneHotEncoder transform to recover original data? Mar 9 '16. 9 TimePicker

The short answer is "no". The encoder takes your categorical data and automagically transforms it to a reasonable set of numbers.

The longer answer is "not automatically". If you provide an explicit mapping using the n_values parameter, though, you can probably implement own decoding at the other side. See the documentation for some hints on how that might be done.

That said, this is a fairly strange question. You may want to, instead, use a DictVectorizer

sklearn.preprocessing.OneHotEncoder, OneHotEncoder transform to recover original data? - got predicted output back. there way reverse encoding , convert output original state? The following are code examples for showing how to use sklearn.preprocessing.OneHotEncoder().They are from open source Python projects. You can vote up the examples you like or vote down the ones you don't like.

If the features are dense, like [1,2,4,5,6], with several number missed. Then, we can mapping them to corresponding positions.

>>> import numpy as np
>>> from scipy import sparse
>>> def _sparse_binary(y):
...     # one-hot codes of y with scipy.sparse matrix.
...     row = np.arange(len(y))
...     col = y - y.min()
...     data = np.ones(len(y))
...     return sparse.csr_matrix((data, (row, col)))
... 
>>> y = np.random.randint(-2,2, 8).reshape([4,2])
>>> y
array([[ 0, -2],
       [-2,  1],
       [ 1,  0],
       [ 0, -2]])
>>> yc = [_sparse_binary(y[:,i]) for i in xrange(2)]
>>> for i in yc: print i.todense()
... 
[[ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]]
[[ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]]
>>> [i.shape for i in yc]
[(4, 4), (4, 4)]

This is a compromised and simple method, but works and easy to reverse by argmax(), e.g.:

>>> np.argmax(yc[0].todense(), 1) + y.min(0)[0]
matrix([[ 0],
        [-2],
        [ 1],
        [ 0]])

Reverse one hot encoding pyspark, Hello, I am using label and then OneHotEncoder to create dummy columns of You can invert a specific single column (before OHE) from working with, and with inverse_transform() you can recover the original predictors. Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:

How to one-hot encode

See https://stackoverflow.com/a/42874726/562769

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]
How to reverse
def one_hot_to_indices(data):
    indices = []
    for el in data:
        indices.append(list(el).index(1))
    return indices


hot = indices_to_one_hot(orig_data, nb_classes)
indices = one_hot_to_indices(hot)

print(orig_data)
print(indices)

gives:

[[2, 3, 4, 0]]
[2, 3, 4, 0]

Preprocessing with sklearn: a complete and comprehensive guide, I encoded my categorical data using sklearn. Is there a way to reverse the encoding and convert my output back to its original state? Python. sklearn.preprocessing.LabelEncoder¶ class sklearn.preprocessing.LabelEncoder [source] ¶. Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y, and not the input X.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably However, dropping one category breaks the symmetry of the original In the inverse transform, an unknown category will be denoted as None. The data in the column usually denotes a category or value of the category and also when the data in the column is label encoded. This confuses the machine learning model, to avoid this the data in the column should be One Hot encoded.

OneHotEncoder transform to recover original data? Backtracking categorical features from one-hot-encoding in scikit-learn? 2. Web traffic from the public goes to  sklearn.preprocessing.OrdinalEncoder¶ class sklearn.preprocessing.OrdinalEncoder (*, categories='auto', dtype=<class 'numpy.float64'>) [source] ¶ Encode categorical features as an integer array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.

Sklearn its preprocessing library forms a solid foundation to guide you… Note that we reset the index and drop the old index column for future convenience. Hence why we have to convert the 999 values in our dataframe to NaN's. We will add this new features later to our original data, for now we can 

Comments
  • TBH I skipped to the TL;DR. However, I found it would not work for me unless I used "out.sort_indices().indices" instead of merely "out.indices". Otherwise, I needed to switch the order of my two columns before subtracting "ohc.feature_indices_[:-1]"
  • Quite right! I included that in the longer answer but left it out of the TL;DR. I've edited to fix this.
  • @Mack Great answer, thank you! Now, what about when we pass the OneHotEncoded X to a predictive model (logistic regression, SVM etc.). How do we map the model's coefficients back to X? I want to be able to say, "variable foo increases the target by bar_coeff" but I don't understand how to map the model's coefficients back to the original category X. Here is the full-blown question posed by another user on SO: stackoverflow.com/questions/40141710/…
  • @Mack and here is my question on it: stackoverflow.com/questions/45041387/…
  • @Phyreese, you can select this as the answer
  • This approach doesn't work for me when orig is a multi-dimensional array (e.g. orig = np.array([[6, 9, 8, 2, 5, 4, 5, 3, 3, 6],[6, 9, 8, 2, 5, 4, 5, 3, 3, 6]]))
  • I feel like I have the same lack of understanding. Why is this a strange question? Without decoding I wouldn't be able to tell what factor coded into 0,1 is paired with what coefficient
  • the onehotencoding implements the vanilla one-of-k algorithm - which optimizes performance by not using a fixed ordering for parameters. this means the algorithm doesn't guarantee the same encoding on multiple runs, and is not reversible. i'm not sure of your use case - if you're looking to do decoding, you're most likely using the wrong algorithm implementation - look at DictVectorizer, or extend the default with a mapping and a custom decoder.
  • While it is true that the algorithm does not guarantee the same encoding on multiple runs, it is false that it is not reversible. It is actually quite easily reversible. Please see my answer for the procedure and a thorough explanation.
  • @Mack have you read your answer and explanation? we have different definitions of easy i think ;)
  • I suppose we do. The TL;DR isn't so bad though. : )
  • This is not an answer to this question!