Hot questions for Using Neural networks in text classification

Top 10 Python Open Source / Neural networks / text classification


I'm writing a program to classify texts into a few classes. Right now, the program loads the train and test samples of word indices, applies an embedding layer and a convolutional layer, and classifies them into the classes. I'm trying to add handcrafted features for experimentation, as in the following code. The features is a list of two elements, where the first element consists of features for the training data, and the second consists of features for the test data. Each training/test sample will have a corresponding feature vector (i.e. the features are not word features).

model = Sequential()

# Adding hand-picked features
model_features = Sequential()
nb_features = len(features[0][0])


model_final = Sequential()
model_final.add(Merge([model, model_features], mode='concat'))

model_final.add(Dense(len(citfunc.funcs), activation='softmax'))

print model_final.summary()[x_train, features[0]], y_train,
                class_weight=data.get_class_weights(x_train, y_train))

y_pred = model_final.predict([x_test, features[1]])

My question is, is this code correct? Is there any conventional way of adding features to each of the text sequences?



input = Input(shape=(params.maxlen,))
embedding = Embedding(params.nb_words,
conv = Convolution1D(nb_filter=params.nb_filter,
drop = Dropout(params.dropout_rate)(conv)
seq_features = GlobalMaxPooling1D()(drop)

# Adding hand-picked features
nb_features = len(features[0][0])
other_features = Input(shape=(nb_features,))

model_final = merge([seq_features , other_features], mode='concat'))

model_final = Dense(len(citfunc.funcs), activation='softmax'))(model_final)

model_final = Model([input, other_features], model_final)


In this case - you are merging features from a sequence analysis with custom features directly - without squashing all custom features to 1 features using Dense.


My question is why should my training set also be skewed (number of instances of positive class much fewer compared to negative class) when my test set is also skewed. I read that it is important to maintain the distribution between the classes the same in both training and test set to get the most realistic performance. For example, if my test set has 90%-10% distribution of class instances, should my training set also have the same proportions?

I am finding it difficult to understand why is it important to maintain the proportions of class instances in the training set as present in the test set.

The reason why I find it difficult to understand is don't we want a classifier to just learn the patterns in both the classes? So, should it matter to maintain skewness in the training set just because the test set is skewed?

Any thoughts will be helpful


IIUC, you're asking about the rationale for using Stratified Sampling (e.g., as used in Scikit's StratifiedKFold.

Once you've divided your data into train and test sets, you have three datasets to consider:

  1. the "real world" set, on which your classifier will really run
  2. the train set, on which you'll learn patterns
  3. the test set, which you'll use to evaluate the performance of the classifier

(So the uses of 2. + 3. are really just for estimating how things will run on 1, including possibly tuning parameters.)

Suppose your data has some class represented far from uniform - say it appears only 5% of the times it would appear if classes would be generated uniformly. Moreover, you believe that this is not a GIGO case - in the real world, the probability of this class would be about 5%.

When you divide into 2. + 3., you run the chance that things will be skewed relative to 1.:

  • It's very possible that the class won't appear 5% of the times (in the train or test set), but rather more or less.

  • It's very possible that some of the feature instances of the class will be skewed in the train or test set, relative to 1.

In these cases, when you make decisions based on the 2. + 3. combination, it's probable that it won't indicate well the effect on 1., which is what you're really after.

Incidentally, I don't think the emphasis is on skewing the train to fit the test, but rather on making the train and test each fit the entire sampled data.


I'm currently using a Naive Bayes algorithm to do my text classification.

My end goal is to be able to highlight parts of a big text document if the algorithm has decided the sentence belonged to a category.

Naive Bayes results are good, but I would like to train a NN for this problem, so I've followed this tutorial: to build my LSTM network on Keras.

All these notions are quite difficult for me to understand right now, so excuse me if you see some really stupid things in my code.

1/ Preparation of the training data

I have 155 sentences of different sizes that have been tagged to a label.

All these tagged sentences are in a training.csv file:


(each integer representing a word)

And all the results are in another label.csv file:

6,7,17,15,16,18,4,27,30,30,29,14,16,20,21 ...

I have 155 lines in trainings.csv, and of course 155 integers in label.csv

My dictionnary has 1038 words.

2/ The code

Here is my current code:

total_words = 1039

## fix random seed for reproducibility

datafile = open('training.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:

X = data;
Y = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")

max_sentence_length = 500

X_train = sequence.pad_sequences(X, maxlen=max_sentence_length)
X_test = sequence.pad_sequences(X, maxlen=max_sentence_length)

# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(total_words, embedding_vecor_length, input_length=max_sentence_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary()), Y, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_train, Y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

This model is never converging:

155/155 [==============================] - 4s - loss: 0.5694 - acc: 0.0000e+00     
Epoch 2/3
155/155 [==============================] - 3s - loss: -0.2561 - acc: 0.0000e+00     
Epoch 3/3
155/155 [==============================] - 3s - loss: -1.7268 - acc: 0.0000e+00  

I would like to have one of the 24 labels as a result, or a list of probabilities for each label.

What am I doing wrong here?

Thanks for your help!


I've updated my code thanks to the great comments posted to my question.

Y_train = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")
Y_test = numpy.genfromtxt("labels_test.csv", dtype="int", delimiter=",")
Y_train =  np_utils.to_categorical(Y_train)
Y_test = np_utils.to_categorical(Y_test)
max_review_length = 50

X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_review_length))
model.add(LSTM(10, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(31, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"]), Y_train, epochs=100, batch_size=30)

I think I can play with LSTM size (10 or 100), number of epochs and batch size.

Model has a very poor accuracy (40%). But currently I think it's because I don't have enough data (150 sentences for 24 labels).

I will put this project in standby mode until I get more data.

If someone has some ideas to improve this code, feel free to comment!


When I started with Neural it seemed I understood Optimizers and Estimators well.

Estimators: Classifier to classify the value based on sample set and Regressor to predict the value based on sample set.

Optimizer: Using different optimizers (Adam, GradientDescentOptimizer) to minimise the loss function, which could be complex.

I understand every estimators come up with an Default optimizer internally to perform minimising the loss.

Now my question is how do they fit in together and optimize the machine training?


short answer: loss function link them together.

for example, if you are doing a classification, your classifier can take input and output a prediction. then you can calculate your loss by take predicted class and ground truth class. the task of your optimizer is to minimize the loss by modifying the parameter of your classifier.


I want to do text classification using a neural network in Keras. I have setup a simple test sample using the following network:

model = Sequential()
model.add(Embedding(NUMVOCABOLARYWORDS, 5, input_length = sequenceDataPadded.shape[1]))
model.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2))

This network accepts tokenized padded sequences of text. E.g. I tokenize the text "hello world" = [0,1,0,0,0..]. It train & evaluates fine.

Now my issue is that I do not want to enter a single sequence into the network, but rather a collection (let's say 500) sequences into the network and get a category out. So instead of an input with shape (100) it's now (500, 100). I'm unsure how to best create the network architecture, ie:

1) Should I flatten the input or try to reduce the dimensions? What layers could I use for that job?

2) Should I just create one large sequence with all the text?

3) Does it even make sense to have a LSTM with 4 dimensions?

4) Does examples exist for classification with an array of array of tokens?

The text is collected text from different sources, so the different sequences in each batch is not necessarily related in relation to anything else than date.


I don't think that merging all text together is the solution. The problem then is that if you feed it to the LSTM it that the hidden states of every text does not start initially. So you feed in the first text, and then the second and all other texts will have the current hidden state.

You could use the functional API and create different inputs and give each input its own LSTM. Then you can merge them and have the dense layers at the end. Another thing that you could try is to use CNN. Again you'd either have to create multiple inputs or concatenate all the inputs and then use CNN layers. The advantage here could be the speed. Because depending on how many LSTMs you have and how big your input is training can take quite a while. Especially because the backpropagation also has to go through every timestep. So performance wise you may be better off with CNNs.

So what I would do is to keep the arrays separately with a max length. Then you pad every array to this length (if they are to short). Then you create multiple inputs with the Functional API and use Conv1D Layers behind it. You do some conv operations (maybe stack a few conv layers, maxpooling, etc.). Then you merge them with the concatenate layer. And then you have some more dense or CNN.


I'm using deeplearning4j but when i load pre-trained model for text-classification I don't have enough RAM on my pc.

I tried to change eclipse.ini file and add more memory changing Xms and Xmx. Unfortunately it doesn't work for me.

In this link seems there is a possible solution to use less RAM even though it cost more time of corse, but I don't care now.

From that link:

Memory-mapped files ND4J supports the use of a memory-mapped file instead of RAM when using the nd4j-native backend. On one hand, it’s slower then RAM, but on other hand, it allows you to allocate memory chunks in a manner impossible otherwise.

Can I add this in a code like this (follow the link)?

Of cours if there is another way (or a better way) write it. I'll appreciate any advice.

Thanks in advance.


I'm from the deeplearning4j project. Memory mapped workspaces are made for embeddings yes and should be considered a separate concept from our off heap memory. The off heap memory is a conceptual rabbit hole I won't cover here (you have to have an understanding of the JVM and the topic isn't relevant here)

The way you would have to use memory mapped workspaces is by loading the word2vec inside a memory mapped scope. The first component is the configuration:

import org.nd4j.linalg.api.memory.MemoryWorkspace;
import org.nd4j.linalg.api.memory.conf.WorkspaceConfiguration;
import org.nd4j.linalg.api.memory.enums.LocationPolicy;
WorkspaceConfiguration mmap = WorkspaceConfiguration.builder()

try (MemoryWorkspace ws =   
           Nd4j.getWorkspaceManager().getAndActivateWorkspace(mmap)) {
 //load your word2vec here            


Of note with memory mapped workspaces is how it should be used. Mem map is intended only for accessing a large array and pulling subsets of it out from ram. You should only use it to pull out a subset of the word vectors out you need for doing training.

When using word2vec (or any other embedding technique), the typical pattern is to lookup only the word vectors you want and merge them together in to a mini batch. That minibatch (and the associated training) should happen in a separate workspace (or have it be unattached which is the default). The reason you can have it unattached is we already do workspaces and the other associated optimizations for you inside of ComputationGraph and MultiLayerNetwork. Just make sure to pass in whatever you need to fit.

From there, use the INDArray get(..) and put(..) methods to copy the rows you need in to another array that you should use for training. For more on that see:

For more information look at leverage, leverageTo, detach,.. in the INDArray javadoc:


There is a neural network that classifies the sentiment of the reviews. The accuracy is not 100%, hence there are texts that are recognized by the network incorrectly. How can I see them? I tried my function, but it gives an error

    data = pd.concat([positive_train_data,negative_train_data,positive_test_data,negative_test_data],ignore_index = True)
    x = data.Text
    y = data.Sentiment

    x_train, x_test, y_train1, y_test = train_test_split(x, y, test_size = 0.50, random_state = 2000)
    print( "Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(x_train),
                                                                                 (len(x_train[y_train1 == 0]) / (len(x_train)*1.))*100,
                                                                                (len(x_train[y_train1 == 1]) / (len(x_train)*1.))*100))

    print ("Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(x_test),
                                                                                 (len(x_test[y_test == 0]) / (len(x_test)*1.))*100,
                                                                                (len(x_test[y_test == 1]) / (len(x_test)*1.))*100))

    tvec1 = TfidfVectorizer(max_features=10000,ngram_range=(1, 2),min_df=3,use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words = 'english')
    x_train_tfidf = tvec1.transform(x_train)
    x_test_tfidf = tvec1.transform(x_test).toarray()
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=10000))
model.add(Dense(50,activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))
optimiz = optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss = 'binary_crossentropy',optimizer = optimiz ,metrics = ['accuracy'])
hist  =,y_train1,validation_data = (x_test_tfidf,y_test ),epochs = 5,batch_size = 64)

And my function

y_pred_vect = model.predict(x_test_tfidf)
# bolean mask
mask = (y_pred_vect != y_test).any(axis=1)
num_words=5000 # only use top 1000 words
INDEX_FROM=3   # word index offset
# этот шаг нужен чтобы получить `test_x` в изначальном виде (до токенизации):
(train_x, _), (test_x, _) = imdb.load_data(num_words=num_words, index_from=INDEX_FROM)
x_wrong = test_x[mask]

word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
all_wrong_sents = [' '.join(id_to_word[id] for id in sent) for sent in x_wrong]

Error on line - mask = (y_pred_vect != y_test).any(axis=1)

Data must be 1-dimensional


Try this...

import numpy as np

mask = np.squeeze(y_pred_vect) != y_test


I have gathered over 20,000 legal pleadings in PDF format. I am an attorney, but also I write computer programs to help with my practice in MFC/VC++. I'd like to learn to use neural networks (unfortunately my math skills are limited to college algebra) to classify documents filed in lawsuits.

My first goal is to train a three layer feed forward neural network to recognize whether a document is a small claims document (with the letters "SP" in the case number), or whether it is a regular document (with the letters "CC" in the case number). Every attorney puts some variant of the word "Case:" or "Case No" or "Case Number" or one of an infinite variations of that. So I've taken the first 600 characters (all attorneys will put the case number within the first 600 chars), and made a CSV database with each row being one document, with 600 columns containing the ASCII codes of the first 600 characters, and the 601st character is either a "1" for regular cases, or a "0" for small claims.

I then run it through the neural network program coded here: (Naturally I update the program to handle 600 neurons, with one output), but when I run through the accuracy is horrible - something like 2% on the training data, and 0% on the general set. 1/8 of documents are for non-small claims cases.

Is this the sort of problem a Neural Net can handle? What am I doing wrong?


So I've taken the first 600 characters (all attorneys will put the case number within the first 600 chars), and made a CSV database with each row being one document, with 600 columns containing the ASCII codes of the first 600 characters, and the 601st character is either a "1" for regular cases, or a "0" for small claims.

Looking at each and every character at the beginning of the document independently will be very inaccurate. Rather than consider the characters independently, first tokenize the first 600 characters into words. Use those words as input to your neural net, rather than individual characters.

Note that once you have tokenized the first 600 characters, you may will find a distinctly finite list of tokens that mean "case number", removing the need for a neural net.

The Standford Natural Language Processor provides this functionality. You can find a .NET compatible implementation available in NuGet.