Hot questions for Using Neural networks in one hot encoding

Question:

I know that categorical data should be one-hot encoded before training the machine learning algorithm. I also need that for multivariate linear regression I need to exclude one of the encoded variable to avoid so called dummy variable trap.

Ex: If I have categorical feature "size": "small", "medium", "large", then in one hot encoded I would have something like:

small  medium  large other-feature
  0      1       0      2999

So to avoid dummy variable trap I need to remove any of the 3 columns, for example, column "small".

Should I do the same for training a Neural Network? Or this is purely for multivariate regression?

Thanks.


Answer:

As stated here, dummy variable trap needs to be avoided (one category of each categorical feature removed after encoding but before training) on input of algorithms that consider all the predictors together, as a linear combination. Such algorithms are:

  • Linear/multilinear regression
  • Logistic regression
  • Discriminant analysis
  • Neural networks that don't employ weight decay

If you remove a category from input of a neural network that employs weight decay, it will get biased in favor of the omitted category instead.

Even though no information is lost when omitting one category after encoding a feature, other algorithms will have to infer the correlation of the omitted category indirectly through combination of all the other categories, making them do more computation for the same result.

Question:

Faced a problem what sounds like a challenging task for me. Have a huge dataset of DNA with A,G,T,C structure, 4 totally different categories as input. It looks like:

1 2 3 4 5 6 7 8 9 … 1.000+
A A G G G G G G G
G G C C C C C C C
T T C C C C C C C
G G A A A A A A A
T T C C C C C C C
C C T T T T T T T
T T C C C C C C C
…
30.000+

Would like to ask about the advice about data processing. Should it be represented as numerical or one hot encoding with such huge dimensionality? Kind of [0,0,0,1] as A, [0,0,1,0] as G and etc., or just 0, 1, 2, 3? Talking about NN - would like to start from simple one to modern and deeper. Typical numerical representation is easily done by pandas, sklearn libraries in few lines of code, but converting it to the one hot encoding, for such huge dataset, looks like an interesting task. By using pd.get_dummies from a (1019, 27041) shape we obtain (1019, 54082), and can not understand why shape increased only 2 times when we have 4 different letters. Thank you!


Answer:

The 2x increase instead of 4x increase is because you only have 2 categories in each of the series. (In your example, A and G in the first row, G and C in the second, T and C in the third and so on)

The below example will give a better understanding of the number of additional columns

In [38]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})                                                                       

In [39]: df                                                                                                                                                    
Out[39]: 
   A  B  C
0  a  b  1
1  b  a  2
2  a  c  3

In [40]: pd.get_dummies(df)                                                                                                                                    
Out[40]: 
   C  A_a  A_b  B_a  B_b  B_c
0  1    1    0    0    1    0
1  2    0    1    1    0    0
2  3    1    0    0    0    1


Question:

I am building a neural network and am at the point of using OneHotEncoder on many independent(categorical) variables. I would like to know if I am approaching this properly with dummy variables or if since all of my variables require dummy variables there may be a better way.

df  
    UserName    Token                       ThreadID    ChildEXE       
0   TAG     TokenElevationTypeDefault (1)   20788       splunk-MonitorNoHandle.exe  
1   TAG     TokenElevationTypeDefault (1)   19088       splunk-optimize.exe 
2   TAG     TokenElevationTypeDefault (1)   2840        net.exe 
807 User    TokenElevationTypeFull (2)      18740       E2CheckFileSync.exe 
808 User    TokenElevationTypeFull (2)      18740       E2check.exe 
809 User    TokenElevationTypeFull (2)      18740       E2check.exe 
811 Local   TokenElevationTypeFull (2)      18740       sc.exe  

ParentEXE           ChildFilePath               ParentFilePath   
splunkd.exe         C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
splunkd.exe         C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
dagent.exe          C:\Windows\System32         C:\Program Files\Dagent 0
wscript.exe         \Device\Mup\sysvol          C:\Windows  1
E2CheckFileSync.exe C:\Util                     \Device\Mup\sysvol\ 1
cmd.exe             C:\Windows\SysWOW64         C:\Util\E2Check 1
cmd.exe             C:\Windows                  C:\Windows\SysWOW64 1

DependentVariable
0
0
0
1
1
1
1

I import the data and using the LabelEncoder on the independent variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#IMPORT DATA
#Matrix x of features
X = df.iloc[:, 0:7].values
#Dependent variable
y = df.iloc[:, 7].values

#Encoding Independent Variable
#Need a label encoder for every categorical variable
#Converts categorical into number - set correct index of column
#Encode "UserName"
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Encode "Token"
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
#Encode "ChildEXE"
labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
#Encode "ParentEXE"
labelencoder_X_4 = LabelEncoder()
X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4])
#Encode "ChildFilePath"
labelencoder_X_5 = LabelEncoder()
X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5])
#Encode "ParentFilePath"
labelencoder_X_6 = LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])

This gives me the following array:

X
array([[2, 0, 20788, ..., 46, 31, 24],
       [2, 0, 19088, ..., 46, 31, 24],
       [2, 0, 2840, ..., 27, 42, 15],
       ...,
       [2, 0, 20148, ..., 17, 40, 32],
       [2, 0, 20148, ..., 47, 23, 0],
       [2, 0, 3176, ..., 48, 42, 32]], dtype=object)

Now for all of the independent variables I have to create dummy variables:

Should I use:

onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6])
X = onehotencoder.fit_transform(X).toarray() 

Which gives me:

X
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.]])

Or is there a better way to approach this this?


Answer:

You can also try : X=pd.get_dummies(X,columns=[0, 1, 2, 3, 4, 5, 6],drop_first=True)

'drop_first=True' saves you from the dummy variable trap.

Question:

I'm training a text classification model where the input data consists of 4096 term frequency–inverse document frequencies.

My output are 416 possible categories. Each piece of data has 3 categories, so there are 3 ones in an array of 413 zeros (one-hot-encodings)

My model looks like this:

model = Sequential()
model.add(Dense(2048, activation="relu", input_dim=X.shape[1]))
model.add(Dense(512, activation="relu"))
model.add(Dense(416, activation="sigmoid"))

When I train it with the binary_crossentropy loss, it has a loss of 0.185 and an accuracy of 96% after one epoch. After 5 epochs, the loss is at 0.037 and the accuracy at 99.3%. I guess this is wrong, since there are a lot of 0s in my labels, which it classifies correctly.

When I train it with the categorical_crossentropy loss, it has a loss of 15.0 and an accuracy of below 5% in the first few epochs, before it gets stuck at a loss of 5.0 and an accuracy of 12% after several (over 50) epochs.

Which one of those would be right for my situation (large one-hot-encodings with multiple 1s)? What do these scores tell me?

EDIT: These are the model.compile() statement:

model.compile(loss='categorical_crossentropy',
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

and

model.compile(loss='binary_crossentropy',
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

Answer:

In short: the (high) accuracy reported when you use loss='binary_crossentropy' is not the correct one, as you already have guessed. For your problem, the recommended loss is categorical_crossentropy.


In long:

The underlying reason for this behavior is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation, as you have. In other words, while your first compilation option

model.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.Adam(),
          metrics=['accuracy']

is valid, your second one:

model.compile(loss='binary_crossentropy',
          optimizer=keras.optimizers.Adam(),
          metrics=['accuracy'])

will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).

Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected loss='binary_crossentropy' and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.

Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # WRONG way

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,  # only 2 epochs, for demonstration purposes
          verbose=1,
          validation_data=(x_test, y_test))

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.9975801164627075

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001

score[1]==acc
# False    

Arguably, the verification of the above behavior with your own data should be straightforward.

And just for the completeness of the discussion, if, for whatever reason, you insist in using binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.98580000000000001

# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001

score[1]==acc
# True    

System setup:

Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4

Question:

I am a beginner who learns machine learning.

I try to make some model(FNN) and this model has too many output labels to use a one-hot encoding.

Could you help me?

I want to solve this problem : labeling data is for fruits:

Type (Apple, Grapes, Peach), Quality(Good, Normal, Bad), Price(Expensive, Normal, Cheap), Size(Big, Normal, Small)

So, If I make one-hot encoding, the data size up to 3*3*3*3, 81

I think that the labeling data looks like 4 one-hot-encoding sequence data.

Is there any way to make labeling data in small-dimension, not 81 dimension one hot encoding?

I think binary encoding also can be used, but recognized some shortcoming to use binary encoding in NN.

Thanks :D


Answer:

If you one hot encode your 4 variables you will have 3+3+3+3=12 variables, not 81.

The concept is that you need to create a binary variable for every category in a categorical feature, not one for every possible combination of categories in the four features.

Nevertheless, other possible approaches are Numerical Encoding, Binary Encoding (as you mentioned), or Frequency Encoding (change every category with its frequency in the dataset). The results often depend on the problem, so try different approaches and see what best fits yours!