sklearn.LabelEncoder with never seen before values

y contains previously unseen labels:
sklearn onehotencoder
sklearn ordinalencoder
cannot import name labelencoder from sklearn impute
labelencoder pipeline
sklearn labelencoder null
label encoder partial fit
labelencoder start from 1

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

This works, but is there a better solution?


As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

LabelEncoder with never seen before values, If a LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for  sklearn.preprocessing.LabelEncoder class sklearn.preprocessing.LabelEncoder [source]. Encode labels with value between 0 and n_classes-1. Read more in the User Guide.

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()

le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

Retrieve label for a single new item, if item is missing then set value as unknown

le_dict.get(new_item, '<Unknown>')

Retrieve labels for a Dataframe column:

df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))

sklearn.LabelEncoder with never seen before values, If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I  How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected. What I want is the encoding of categorical variables via one-hot-encoder. However, sk-learn does not support strings for that. So I used a label encoder on each column.

I get the impression that what you've done is quite similar to what other people do when faced with this situation.

There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially and, but changing the existing behavior is actually more difficult than it seems at first glance.

For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.

Handle unseen labels in `LabelEncoder` · Issue #13423 · scikit , The only solution is to impute/replace some known value to this unknown ones. as it should since it found an unseen label lbl_encoder = LabelEncoder()  sklearn.preprocessing.LabelEncoder¶ class sklearn.preprocessing.LabelEncoder [source] ¶ Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y, and not the input X. Read more in the User Guide.

I have created a class to support this. If you have a new label comes, this will assign it as unknown class.

from sklearn.preprocessing import LabelEncoder
import numpy as np

class LabelEncoderExt(object):
    def __init__(self):
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        self.label_encoder = + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)

The sample usage:

country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']

label_encoder = LabelEncoderExt()
print(label_encoder.classes_) # you can see new class called Unknown

new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']

LabelEncoder should add flexibility to future new label · Issue #8136 , Is there any "graceful" option in sklearn to recode or ignore, or do we only fail hard? In each batch we'd not see all possible values of F. In this situation it would be very Handle unseen labels in `LabelEncoder` #13423. I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels?

I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library. Search for skutil.preprocessing.OneHotCategoricalEncoder or skutil.preprocessing.SafeLabelEncoder. In their SafeLabelEncoder(), unseen values are auto encoded to 999999.

sklearn.preprocessing.LabelEncoder, Encode target labels with value between 0 and n_classes-1. It can also be used to transform non-numerical labels (as long as they are hashable and  Pipeline doesn't work with Label Encoder #3956. view it on GitHub <#3956 What if a class you've never seen before shows up in your testing data? You need to

Choosing the right Encoding method-Label vs OneHot Encoder, Label Encoding in Python can be achieved using Sklearn Library. If a label repeats it assigns the same value to as assigned earlier. After applying Label Encoder we will get a result as seen below the insurance.csv file and made a new file insuranceTest.csv to predict the charge on unseen data. python - sklearn.LabelEncoder with never seen before values . If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for this is to map everything ne…

Categorical encoding using Label-Encoding and One-Hot-Encoder, This approach is very simple and it involves converting each value in a from sklearn.preprocessing import LabelEncoder# creating initial  When training, there is no option to ignore missing values. But if handle_unknown is set to 'ignore', then missing values in the test set will be set to a row of all 0's. Personally, I like to have the options to be able to encode missing values in the training set as a row of 0's.

Handling categorical data with sklearn., In sklearn, I cannot directly put categorical column 'Sex' which has string like '​male' and 'female'. First use Label encoder to convert them them into Numerical values as scikit OneHot Encoder Obviously the model has not learnt these. In Data science, we work with datasets which has multiple labels in one or more columns. They can be in numerical or text format of any encoding. This will be ideal and understandable by humans…

  • If I understand the question correctly, wouldn't you save time explicitly adding the '<unknown>' class to your original fit model as a sort of placeholder?
  • I'm not sure I see how it would save time, as I'd still have to perform a mapping from the new, unseen values to '<unknown>' (by testing their inclusion in the list of known classes)? Do you see a way to avoid it?
  • Are you quite sure this works? I tried your solution and <unknown> seems to always be mapping to [0]. However 0 was already mapping to another variable, so it ended up being (silently) wrong.
  • I preferred to move away from the scikit solutions and to go 100% Pandas on this one because of out-of-vocab labels. Answered below with how I've been solving it, based on this answer.
  • This solution does no longer work. Atleast with my code. If I replace '<unknown>' with the actual unseen feature value le.transform(X) doenst throw a ValueError. (Just for debugging / testing this code). Does anybody know if there is a working, better solution to this problem without using Dataframe.get.dummies() ?
  • Instead of dummies.columns, do you mean dummy_train.columns?
  • @KevinMarkham kudos to you Sir, caught a bug that had been there for almost a year :)
  • When saving (pickle) the model, do you save dummy_train.columns into its own file?
  • @matthiash generally I'll use it in a pipeline object. I can't say I know enough about pickling, I generally avoid it, but would venture a guess that the state in the pipeline should hold and keep those columns
  • @matthiash in my case, I saved the columns in the same file as the model. Just make sure you write and read in the same order!
  • This is answer is pretty concise and effective ;) (upvote)
  • Your solution is for usage by far the simplest one. Thanks!
  • This works perfectly
  • Have they not tried to submit to sklearn itself? This is a universal issue. Obviously we parameterize the default_label_value.