Hot questions for Using Neural networks in gensim

Question:

After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?


Answer:

The vocabulary is in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:

len(w2v_model.wv.vocab)

(In older gensim versions before 0.13, vocab appeared directly on the model. So you would use w2v_model.vocab instead of w2v_model.wv.vocab.)

Question:

There is Convolution1D example https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py without word2vec.

Currently, I am using gensim to train word2vec model.

I want to use word2vec and keras cnn(2D not 1D) to do document classifacation(Chinese Text). I learned the basic flow of text classification in cnn and want to do a test.

For example(the steps I imagine):
  1. Use a good Cinese Tokenized Text Set to train word2vec model

    model = gensim.models.Word2Vec(new_sentences, workers=10, size=200, min_count=2)
    
  2. Tokenize my sentences dataset to words lists dataset(the longest sentence has over 8000 words, shortest is less 50)

    1     ['你们', '好', '今天', '天气', '真', '好']
    2     ['嗯', '对的']
    ...
    9999  ['好', '就', '这样']
    
  3. Use a method to transform words lists dataset to word2vec dataset

    transform every word in every sencence to a vec by trained model.

    1     [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
    2     [[word2vec size=200], [word2vec size=200]]
    ...
    9999  [[word2vec size=200], [word2vec size=200], [word2vec size=200]]
    
  4. Pad word2vec dataset (with size=200 zero array)

    1     [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
    2     [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
    ....
    9999  [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
    
  5. go to the CNN (using Convolution2D)


I search for a long time, but can't find any way to do the step 3 ( after step 3, the parameter and layers setting in step 5 is hard to understand too).


Answer:

Transforming a single sentence to a 2D vector

Assuming you have a list of words and a model you can do:

import numpy as np
sentence_vec = None
for word in sentence:
    word_vec = np.expand_dims(model[word], axis=0)
    if sentence_vec is None:
        sentence_vec = word_vec
    else:
        sentence_vec = np.concatenate((sentence_vec, word_vec), axis=0)

As for step 5 - it would be helpful if you listed what you are having trouble with. Basically you only need to do is change both 1D operations (Convolution1D, GlobalMaxPooling1D) to their 2D counter-parts .