sklearn - how to incorporate missing data when one-hot encoding

sklearn "categoricalencoder"
sklearn pipeline one-hot encoding
numpy one-hot encoding
one-hot encoding to process missing values
one-hot encoding inverse transform
one-hot encoding pandas
one hot encoding multiple columns pandas
sklearn onehoten

I'm trying to keep rows in a dataset that contain missing data.

When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that if currentItem == null or if currentItem == 0 then set the output array to all 0s?


A A B -> [[1, 0], [1, 0], [0,1]]

B B A -> [[0, 1], [0, 1], [1,0]]

null B A -> [[0, 0], [0, 1], [1,0]]

one-hot encoding:

import numpy as np
from sklearn.preprocessing import LabelEncoder

dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]

encoder = LabelEncoder()
encoded_B = encoder.transform(B)

Y = to_categorical(encoded_B)

EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs

A     B     C     D     E     X      Y
7     6     3     3     2     11     4
5     6     0     0     7     15     7
3     3     9     null  7     12     7
7     null  7     null  7     12     13
null  7     4     6     12    13     4
null  5     7     6     null  14     7
2     6     0     0     2     13     3
7     null  7     null  2     13     7

If you have pandas, this is pretty simple.

s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])

0      A
1      A
2      0
3      B
4      0
5      A
6    NaN
dtype: object

Use replace to convert 0 to NaN -

s = s.replace({0 : np.nan, '0' : np.nan})

0      A
1      A
2    NaN
3      B
4    NaN
5      A
6    NaN
dtype: object

Now, call pd.get_dummies, which ignores NaN values.


   A  B
0  1  0
1  1  0
2  0  0
3  0  1
4  0  0
5  1  0
6  0  0

The solution is the same for a dataframe.

Machine Learning: Missing Data & One-Hot-Encoding, Further, I will add one-hot-encoding to my model. A 2 bedroom house wouldn't include an answer for How large is the third bedroom; Someone being Python libraries represent missing numbers as nan which is short for "not a number". This technique is called one-hot encoding. In order to perform this transformation, we can use the scikit-learn.preprocessingOneHotEncoder : Note that when we initialized the OneHotEncoder , we defined the column position of the variable that we want to transform via the categorical_features parameter which is the first column in the feature

Or, you can try to use the pandas fillna() method. (Source: Let's say you have a DataFrame called df. Then, you can do:

df = df.fillna(0)

to convert all NaN in df into zeros, before passing it through one-hot encoding.

Input contains NaN when onehotencoding, I tried to drop columns with missing values, and get this error: for strategy the mean ( One-​hot encoding removed index; put it back OH_cols_test.index imputer or one hot encoder on test data, u should train it only on train data, then just use it for test data. The fit_transform method expects a 2D array, reshape to transform from 1D to a 2D array. The fit_transform method returns a sparse array. Use the toarray () method to return a numpy array and assign this to variable X which has our one hot encoded results.

I would suggest you to replace nan value with 'none' that will introduce additional column i.e df_encoding_variables = df_encoding_variables.replace(np.nan,'None')

Column Transformer with Mixed Types, In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a  Scaling: It scales values of columns of data and brings them into the same range which helps machine learning algorithms to converge faster. Imputation: It fills in NA values in a data. One Hot Encoding: It transforms categorical columns of data into different columns where each column is binary column representing the presence/absence of one

scikit-learn : Data Preprocessing I - Missing/categorical data, Binary values can then be used to indicate the particular color of a sample; for example, a blue sample can be encoded as blue=1, green=0, red=0. This technique  6.4. Imputation of missing values¶ For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

Handle missing values in OneHotEncoder · Issue #11996 · scikit , - Represent with another one hot column : Pick a one hot encoded column in the data and replace the row with it? A Simple Guide to Scikit-learn Pipelines numeric variables so as a minimum I am going to have to apply a one hot encoding transformation and some sort of scaler. to fill in any missing

How do I encode categorical features using scikit-learn?, You'll also learn how to include this step within a Pipeline so that you can cross-​validate Duration: 27:59 Posted: Nov 12, 2019 6.3.1. Standardization, or mean removal and variance scaling¶. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

  • Do you have pandas?
  • I do, is there a solution using pandas?
  • What exactly is dataset? An array or something else? Add a sample one?
  • Would you say one-hot encoding is better using pandas rather than sklearn then?
  • @JoeBoggs They have slightly disjoint use cases. Sklearn's Label Encoder is useful when used as part of a larger pipeline. Meanwhile, get_dummies is useful for cases such as yours. It also makes it easy to generate a sparse array of encodings, which I don't believe sklearn does. But in general, they do the same thing.
  • @JoeBoggs To elaborate the above comment, LabelEncoder (In combination with OneHotEncoder) is useful in cases when data is split into train and test and test do not have all the categories as in train. In that case pd.get_dummies will generate different number of columns for train and test and may produce error in further parts of pipeline, because columns have changed. LabelEncoder will preserve the columns. But when using the data as whole, pd.get_dummies will be better.
  • @VivekKumar Spot on description, thanks for leaving a comment.