Reshaping and encoding multi-column categorical variables to one hot encoding

one-hot encoding python pandas example
one-hot encoding neural network
one hot encoding text
label encoding
keras one-hot encoding
one-hot encoding inverse transform
sklearn pipeline one-hot encoding
multi hot encoding

I have some data which looks as follows:

    Owner   Label1  Label2  Label3      
    Bob     Dog     N/A     N/A 
    John    Cat     Mouse   N/A 
    Lee     Dog     Cat     N/A
    Jane    Hamster Rat     Ferret

And I want it reshaped to one-hot encoding. Something like this:

    Owner   Dog     Cat     Mouse    Hamster    Rat    Ferret   
    Bob     1       0       0        0          0      0
    John    0       1       1        0          0      0    
    Lee     1       1       0        0          0      0
    Jane    0       0       0        1          1      1

I've looked around the documentation and stackoverflow, but haven't been able to determine the relevant functions to achieve this. get_dummies comes pretty close, but it creates a prefix for each category only when that category appears in a respective column.

Using

df.set_index('Owner').stack().str.get_dummies().sum(level=0)
Out[535]: 
       Cat  Dog  Ferret  Hamster  Mouse  Rat
Owner                                       
Bob      0    1       0        0      0    0
John     1    0       0        0      1    0
Lee      1    1       0        0      0    0
Jane     0    0       1        1      0    1

Or

s=df.melt('Owner')
pd.crosstab(s.Owner,s.value)
Out[540]: 
value  Cat  Dog  Ferret  Hamster  Mouse  Rat
Owner                                       
Bob      0    1       0        0      0    0
Jane     0    0       1        1      0    1
John     1    0       0        0      1    0
Lee      1    1       0        0      0    0

One-hot vs dummy encoding in Scikit-learn, is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1. One-Hot Encoding. One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. One-Hot Encoding is the process of creating dummy variables.

You could use get_dummies on the stacked dataset, then groupby and sum:

pd.get_dummies(df.set_index('Owner').stack()).groupby('Owner').sum()

       Cat  Dog  Ferret  Hamster  Mouse  Rat
Owner                                       
Bob      0    1       0        0      0    0
John     1    0       0        0      1    0
Lee      1    1       0        0      0    0
Jane     0    0       1        1      0    1

How to One Hot Encode Sequence Data in Python, How to integer encode and one hot encode categorical variables for We can also reshape the output variable to be one column (e.g. a 2D shape). each categorical input variable in a multi-input layer model is listed below. Categorical data; Computational Tools; Creating DataFrames; Cross sections of different axes with MultiIndex; Data Types; Dealing with categorical variables; One-hot encoding with `get_dummies()` Duplicated data; Getting information about DataFrames; Gotchas of pandas; Graphs and Visualizations; Grouping Data; Grouping Time Series Data; Holiday

sklearn.preprocessing.MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

o, l = zip(*[[o, [*filter(pd.notna, l)]] for o, *l in zip(*map(df.get, df))])

mlb = MultiLabelBinarizer()

d = mlb.fit_transform(l)
pd.DataFrame(d, o, mlb.classes_)

      Cat  Dog  Ferret  Hamster  Mouse  Rat
Bob     0    1       0        0      0    0
John    1    0       0        0      1    0
Lee     1    1       0        0      0    0
Jane    0    0       1        1      0    1

Same-ish answer
o = df.Owner
l = [[x for x in l if pd.notna(x)] for l in df.filter(like='Label').values]

mlb = MultiLabelBinarizer()

d = mlb.fit_transform(l)
pd.DataFrame(d, o, mlb.classes_)

       Cat  Dog  Ferret  Hamster  Mouse  Rat
Owner                                       
Bob      0    1       0        0      0    0
John     1    0       0        0      1    0
Lee      1    1       0        0      0    0
Jane     0    0       1        1      0    1

3 Ways to Encode Categorical Variables for Deep Learning, A one hot encoding is a representation of categorical variables as binary vectors. integer_encoded = integer_encoded.reshape(len(integer_encoded), 1) NumPy function to locate the index of the column with the largest value. Do you know how I would go about combining multiple one-hot vectors of  Hereby, I would focus on 2 main methods: One-Hot-Encoding and Label-Encoder. Both of these encoders are part of SciKit-learn library (one of the most widely used Python library) and are used to convert text or categorical data into numerical data which the model expects and perform better with.

The pandas.get_dummies function converts categorical variable into dummy/indicator variables in a single step

One-Hot Encoding vs. Label Encoding using Scikit-Learn, Exploring Label And One Hot Encoding using Scikit-Learn. includes multiple columns – a combination of numerical as well as categorical variables. As it turns out, there are multiple ways of handling Categorical variables. #reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally  One of the ways to do it is to encode the categorical variable as a one-hot vector, i.e. a vector where only one element is non-zero, or hot. With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features, i.e.:

How to One Hot Encode Categorical Variables of a Large Dataset in , categorical variable one hot encoding Each column representing (with a 0 or 1) whether a particular value (color, in our case) is present or not. 1)) >>> encoder​.transform(sex.head().values.reshape(-1, 1)).todense() <5x2  Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices: Advanced Regression Techniques

4. Representing Data and Engineering Features, The one-hot encoding we use is quite similar, but not identical, to the dummy There are two ways to convert your data to a one-hot encoding of categorical variables, A good way to check the contents of a column is using the value_counts (also known as discretization) of the feature to split it up into multiple features,  One Hot Encoding –. It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

Python One Hot Encoding with SciKit Learn, Let's now perform one hot encoding on these two categorical variables. Because our Color and Make columns contain text, we first need to convert The fit_transform method expects a 2D array, reshape to transform from 1D to a 2D array. In your specific application, you'll have to provide a list of column that are Categorical, or you'll have to infer which columns are Categorical. Best case scenario your dataframe already has these columns with a dtype=category and you can pass columns=df.columns[df.dtypes == 'category'] to get_dummies .

Comments
  • df.set_index('Owner').stack().groupby(level=0).apply('|'.join).str.get_dummies()
  • Both these methods are reducing the number of rows I have in my dataset? Going from 3159 to 2599. Any clue why?
  • @user3297011 duplicated Owner have
  • This method is reducing the number of rows I have in my dataset? Going from 3159 to 2599. Any clue why?
  • Sounds like you've got duplicate owners, probably