Hot questions for Using Neural networks in categorical data


In the CNN example for the minst dataset for Keras they tell you how to make a good CNN network to recognise hand written digits. The issue is that it doesn't tell you how to predict new digits.

For example give an image, if I do this:


instead of telling me what digits it think it is, it instead gives me a list of 10 numbers (presumably probabilities)


You can use numpy's argmax to find out class that has maximum probability

import numpy as np
probabilities = model.predict(image)
classes = np.argmax(probabilities, axis=-1)


I am currently using a neural network that outputs a one hot encoded output.

Upon evaluating it with a classification report I am receiving this error:

UndefinedMetricWarning: Recall and F-score are ill-defined and being set 
to 0.0 in samples with no true labels.

When one-hot encoding my output during the train-test-split phase, I had to drop one of the columns in order to avoid the Dummy Variable Trap. As a result, some of the predictions of my neural network are [0, 0, 0, 0], signaling that it belongs to the fifth category. I believe this to be the cause of the UndefinedMetricWarning:.

Is there a solution to this? Or should I avoid classification reports in the first place and is there a better way to evaluate these sorts of neural networks? I'm fairly new to machine learning and neural networks, please forgive my ignorance. Thank you for all the help!!

Edit #1:

Here is my network:

from keras.models import Sequential
from keras.layers import Dense

classifier = Sequential()
classifier.add(Dense(units = 10000,
                     input_shape = (30183,),
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 4583,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 1150,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 292,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 77,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 23,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 7,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'relu'
classifier.add(Dense(units = 4,
                     kernel_initializer = 'glorot_uniform',
                     activation = 'softmax'

classifier.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

The above is my network. After training the network, I predict values and convert them to class labels using:

from sklearn.preprocessing import LabelBinarizer

labels = np.argmax(predictions, axis = -1)
lb = LabelBinarizer()
labeled_predictions = lb.fit_transform(labels)

Upon calling a classification report comparing y_test and labeled_predctions, I receive the error.

**As a side note for anyone curious, I am experimenting with natural language processing and neural networks. The reason the input vector of my network is so large is that it takes in count-vectorized text as part of its inputs.

Edit #2:

I converted the predictions into a dataframe and dropped duplicates for both the test set and predictions getting this result:


      javascript  python    r   sql
 738           0       0    0     0
4678           1       0    0     0
6666           0       0    0     1
5089           0       1    0     0
6472           0       0    1     0


     javascript python  r   sql
738           1      0  0     0
6666          0      0  0     1
5089          0      1  0     0
3444          0      0  1     0

So, essentially what's happening is due to the way softmax is being converted to binary, the predictions will never result in a [0,0,0,0]. When one hot encoding y_test, should I just not drop the first column?


Yes I would say that you should not drop the first column. Because what you do now is to get the softmax and then take the neuron with the highest value as label (labels = np.argmax(predictions, axis = -1) ). With this approach you can never get a [0,0,0,0] result vector. So instead of doing this just create a onehot vector with positions for all 5 classes. You're problem with sklearn should then disappear, as you will get samples with true labels for your 5th class.

I'm also not sure if the dummy variable trap is a problem for neural networks. I have never heard from this before and a short google scholar search did not find any results. Also in all resources I've seen so far about neural networks I never saw this problem. So I guess (but this is really just a guess), that it isn't really a problem that you have when training neural networks. This conclusion is also driven by the fact that the majority of NNs use a softmax at the end.


I'm currently working on an artificial intelligence project where the inputs consist of a fixed-length vector of letters, either A, B, C, or D. I'd like to be able to input what letter exists at each position in the vector into the Neural Network. For example, at each position, have an array such that the letter at that position has a 1 in the corresponding input array, while all other positions in the array are 0. For example, if the letter in the tenth position of the letter vector is A, the "input vector" for the input neuron would be something like this:

[A B C D]
[1 0 0 0]

Of course, this could originate from a letter vector like this:

[A B C D D B A A B C A A]

However, input neurons cannot take vectors as inputs. Therefore, what is the best way to format this input for input into a neural network?


I think what you are talking about is called one-hot-encoding. If you perform this operation on your example [A B C D] you will get this:

[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]

Where the first column indicates whether it is an A, the second wether its a B and so on.

You can't insert vectors into a single input of the NN, but instead of having only 4 inputs you can reshape the encoder matrix and have 16 inputs instead.


I have created an Artificial Neural Network with 4 categorical features and a binary outcome either 1 for suspicious or 0 for non-suspicious:

  ParentPath                                  ParentExe
0   C:\Program Files (x86)\Wireless AutoSwitch  wrlssw.exe
1   C:\Program Files (x86)\Wireless AutoSwitch  WrlsAutoSW.exs
2   C:\Program Files (x86)\Wireless AutoSwitch  WrlsAutoSW.exs
3   C:\Windows\System32                         svchost.exe
4   C:\Program Files (x86)\Wireless AutoSwitch  WrlsAutoSW.exs

ChildPath                                   ChildExe    Suspicious
C:\Windows\System32                         conhost.exe  0
C:\Program Files (x86)\Wireless AutoSwitch  wrlssw.exe   0 
C:\Program Files (x86)\Wireless AutoSwitch  wrlssw.exe   0
C:\Program Files\Common Files               OfficeC2RClient.exe  0
C:\Program Files (x86)\Wireless AutoSwitch  wrlssw.exe  1
C:\Program Files (x86)\Wireless AutoSwitch  wrlssw.exe  0

I have used sklearn for label encoding and one hot encoding on the data:

#Import the dataset
X = DBF2.iloc[:, 0:4].values
#X = DBF2[['ParentProcess', 'ChildProcess']]
y = DBF2.iloc[:, 4].values#.ravel()

#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#Label Encode Parent Path
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Label Encode Parent Exe
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
#Label Encode Child Path
labelencoder_X_3 = LabelEncoder()
X[:, 2] = labelencoder_X_3.fit_transform(X[:, 2])
#Label Encode Child Exe
labelencoder_X_4 = LabelEncoder()
X[:, 3] = labelencoder_X_4.fit_transform(X[:, 3])

#Create dummy variables
onehotencoder = OneHotEncoder(categorical_features = [0,1,2,3])
X = onehotencoder.fit_transform(X)

I have split the data into a training and test set and run it on my gpu box with a nvidia 1080. I have tuned the hyperparameters and am now ready to use the model that is trained in a production environment with one test sample being tested at a time. Lets say I just want to test one sample:

   ParentPath            ParentExe     ChildPath           ChildExe
0  C:\Windows\Malicious  badscipt.exe  C:\Windows\System   cmd.exe  

The issue that I am running into is the training set has seen the ChildPath "C:\Windows\System" and the ChildExe "cmd.exe" which are normal, but the training set has not seen the ParentPath "C:\Windows\Malicous" or ParentExe "badscipt.exe" so these have not been label or one hot encoded. My big question is how do I handle one test feature where part of it has not been trained?

I have seen examples using feature hashing but im not sure how to apply that or if that would even solve this problem. Any help or pointers would be greatly appreciated.


#Create data frame with malicous test
testmalicious = {'ParentProcess': ['ParentProcess': ['C:\Windows\System32\services.exe'], 'ChildProcess': ['C:\Windows\System32\svch0st.exe'], 'Suspicous': [1]}
testmaliciousdf = pd.DataFrame(data=testmalicious)
testmaliciousdf = testmaliciousdf[['ParentProcess', 'ChildProcess', 'Suspicous']]
#Add the malicious to the end of dataframe
DBF1 = DBF2.append(testmaliciousdf)
DBF2 = DBF1.reset_index(drop=True)
#Location where mal_array sample is located - after label and one hot encoded pull out of training set
mal_array = X[368827:368828]
#Remove the last line of the array from training set
#Remove the last line of the array from the y data
#At the end test if suspicious or not
new_prediction = classifier.predict(sc.transform(mal_array))
new_prediction = (new_prediction > 0.5)



Currently creating a neural network, and need to have the data structured properly. For one of the data columns, there is string data that needs to be converted to a numeric. Only problem is, is that the string data in each row is example QWERTGCD, AWERTKRD, TWERTKRR'etc. There is over 1000 lines of rows, each one having the same or different strings like in the example posted. I dont know how to convert multiple strings, into categorical data on this scale. Same thing goes for the labels partion.

So far I have this to start with

dataset$Box = as.numeric(factor(dataset$Box, levels = c(), labels = c()))

Not sure if I am overthinking this, but I cant figure how exactly to input the levels and tables without painstakingly going through the data, and inputing in myself.

Here's an example of the data that being worked with.

B,11979,13236,1261,3,QWERTGCD,1 B,475514,476069,559,33,QWERTOOD,1 C,65534,65867,337,1,QWERAEER,1 C,73738,74657,923,2,AWERTWED,1



Without a reproducible example, it's hard to know exactly what you need, but in general, one thing R is good at is running operations on entire columns all at once. You're just converting a column in dataset that is named Box from a string to numeric, going through a factor. factor() finds all the unique values in your column for you. So you don't need to specify them.

dataset$Box <- as.numeric(factor(dataset$Box))

will take the Box column in dataset and convert it from class character to class numeric, numbering the character values in Box in alphanumeric order (unless you specify otherwise). It may even already be a factor, depending on how your dataset was generated. You can check with class(dataset$Box). If that returns factor then you just need to run dataset$Box <- as.numeric(dataset$Box)