How to encode multiple features at once with SciKit Learn transform

onehotencoder multiple columns
label encoding across multiple columns in scikit-learn
one hot encoder
one hot encoding vs label encoding
onehotencoder(categorical_features not working)
one-hot encoding pandas
numpy one-hot encoding
typeerror(argument must be a string or number label encoder)

I am trying to encode some categorical features to be able to use them as features in a machine learning model, at the moment I have the following code:

data_path = '/Users/novikov/Assignment2/epl-training.csv'
data = pd.read_csv(data_path)
data['Date'] = pd.to_datetime(data['Date'])

le = preprocessing.LabelEncoder()


data['HomeTeam'] = le.fit_transform(data.HomeTeam.values)
data['AwayTeam'] = le.fit_transform(data.AwayTeam.values)
data['FTR'] = le.fit_transform(data.FTR.values)
data['HTR'] = le.fit_transform(data.HTR.values)
data['Referee'] = le.fit_transform(data.Referee.values)

This works fine, however this is not ideal because if there were 100 features to encode, it would take way too long to do it by hand. How do I automate the process? I have tried implementing a loop:

label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

for feature in label_encode:
    method = 'data.' + feature + '.values'
    data[feature] = le.fit_transform(method)

But I get ValueError: bad input shape ():

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1b8fb6164d2d> in <module>()
     11     method = 'data.' + feature + '.values'
     12     print(method)
---> 13     data[feature] = le.fit_transform(method)

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    109         y : array-like of shape [n_samples]
    110         """
--> 111         y = column_or_1d(y, warn=True)
    112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape ()

None of the variations of this code (like just putting data.feature.values) seem to work. There must be a way of doing it other than writing it by hand.

Of course, method = 'data.' + feature + '.values' will not work - it is a string itself! Try instead

method = data[feature].values

or

for feature in label_encode:
    data[feature] = le.fit_transform(data[feature].values)

Label encoding across multiple columns in scikit-learn, Label Encoder and One Hot Encoder are classes of the SciKit Learn library in Python. LabelEncoder class from the sklearn library, then fit and transform your data. For your problem, you can use OneHotEncoder to encode features of your  Let us encode for all the columns now. obj_df_trf=obj_df.astype(str).apply(le.fit_transform) All the columns are now encoded to a numerical weight on each value in the column.

I am just fixing your code adding pd.eval

label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

for feature in label_encode:
    method = 'data.' + feature + '.values'
    data[feature] = le.fit_transform(pd.eval(method))

sklearn.preprocessing.OrdinalEncoder, sklearn.preprocessing . Performs a one-hot encoding of categorical features. the unique values per feature and transform the data to an ordinal encoding. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n_values).

The way the encoder object works is that when you fit it stores some meta data in the object's attributes. These attributes get used when you want to transform the data. fit_transform is a convenience method to fit and transform in one step.

When you decide to use the same object to do another fit_transform, you are overwriting the the stored meta data. That is fine if you don't want to use the objects inverse_transform.

Setup
df = pd.DataFrame({
    'HomeTeam':[1, 3, 27],
    'AwayTeam':[9, 8, 100],
    'FTR':['dog', 'cat', 'dog'],
    'HTR': [*'XYY'],
    'Referee': [*'JJB']
})

Answer to your question

update and apply

le = preprocessing.LabelEncoder()
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df.update(df[label_encode].apply(le.fit_transform))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

How I'd Do It

Each separate encoder is captured in the le dictionary for potential later use

from collections import defaultdict
le = defaultdict(preprocessing.LabelEncoder)
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df = df.assign(**{k: le[k].fit_transform(df[k]) for k in label_encode})
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

pandas.factorize

If you just want codes, you can use Pandas' factorize. Note that this will not sort the final values and labels them in the order they first appear.

df.update(df[label_encode].apply(lambda x: x.factorize()[0]))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         0   0   0         0       0
1         1   1   1         1       0
2         2   0   1         2       1
Numpy's unique

This does sort the final values and will look like LabelEncoder

df.update(df[label_encode].apply(lambda x: np.unique(x, return_inverse=True)[1]))

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

sklearn.preprocessing.OneHotEncoder, Encode categorical features as a one-hot numeric array. encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. Label Encoder and One Hot Encoder are classes of the SciKit Learn library in Python. Label Encoding It converts categorical text data into model-understandable numerical data, we use the Label Encoder class.

It's a little awkward but you access the values from the series and then call fit transform on that, while selecting the series inside the for loop "X[c]=" to indicate you want to assign values back to the DF.

X = pd.DataFrame({
    'A':[1, 3, 27],
    'B':[9, 8, 100],
    'C':['dog', 'cat', 'dog']})
print(X.head())

le = LabelEncoder()

for c in X.columns:

    X[c] = le.fit_transform(X[c].values)

X.head()

sklearn.preprocessing.LabelEncoder, . Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y , and not the input X . When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature. sklearn.compose.make_column_selector gives this possibility.

Label Encoder vs. One Hot Encoder in Machine Learning, is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. sklearn.preprocessing.LabelEncoder¶ class sklearn.preprocessing.LabelEncoder [source] ¶. Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y, and not the input X.

Guide to Encoding Categorical Values in Python, sklearn.preprocessing. Encode categorical features as a one-hot numeric array​. It can also be used to transform non-numerical labels (as long as they are  We will use SciKit learn labelencoder class to help us perform this step. Start by initializing two label encoders, one for Color and one for Make. Next, call the fit transform method which will process our data and transform the text into one numerical value for each. Assign the results to 2 new columns, color_encoded and make_encoded.

How to encode labels for multiple columns with Scikit-learn in Python, How to encode labels for multiple columns with Scikit-learn in Python Label encoding is the process of turning more ambiguous categorical data, DataFrame.apply() to apply the label encoding function to each column in the data frame. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.