Hot questions for Using Neural networks in vowpalwabbit


In sklearn when we pass sentence to algorithms we can use text features extractors like the countvectorizer, tf-idf vectoriser etc... And we get an array of floats.

But what we get when passed to vowpal wabbit the input file like this one:

-1 |Words The sun is blue
1 |Words The sun is yellow

What is used in internal implementation of vowpal wabbit? How does this text transform?


There are two separate questions here:

Q1: Why can't you (and shouldn't you) use transformations like tf-idf when using vowpal wabbit ?

A1: vowpal wabbit is not a batch learning system, it is an online-learning system. In order to compute measures like tf-idf (term frequency in each document vs the whole corpus) you need to see all the data (corpus) first, and sometimes do multiple passes over the data. vowpal wabbit as an online/incremental learning system is designed to also work on problems where you don't have the full data ahead of time. See This answer for a lot more details.

Q2: How does vowpal wabbit "transform" the features it sees ?

A2: It doesn't. It simply maps each word feature on-the-fly to its hashed location in memory. The online learning step is driven by a repetitive optimization loop (SGD or BFGS) example by example, to minimize the modeling error. You may select the loss function to optimize for.

However, if you already have the full data you want to train on, nothing prevents you from transforming it (using any other tool) before feeding the transformed values to vowpal wabbit. It's your choice. Depending on the particular data, you may get better or worse results using a transformation pre-pass, than by running multiple passes with vowpal wabbit itself without preliminary transformations (check-out the vw --passes option).

To complete the answer, let's add another related question:

Q3: Can I use pre-transformed (e.g. tf-idf) data with vowpal wabbit ?

A3: Yes, you can. Just use the following (post-transformation) form. Instead of words, use integers as feature IDs and since any feature can have an optional explicit weight, use the tf-idf floating point as weights, following the : separator in typical SVMlight format:

-1 |  1:0.534  15:0.123  3:0.27  29:0.066  ...
1  |  3:0.1  102:0.004  24:0.0304  ...

The reason this works, is because vw has a nice feature of distinguishing between string and integer-features. It doesn't hash feature-names that look like integers (unless you use the --hash_all option explicitly). Integer feature numbers are used directly as if they were the hash result of the feature.


I would like to train the binary sigmoidal feedforward network for category classification with following command using awesome vowpal wabbit tool:

vw --binary --nn 4 train.vw -f category.model

And test it:

vw --binary -t -i category.model -p test.vw

But I had very bad results (comparing to my linear svm estimator).

I found a comment that I should use Number of Training Passes argument (--passes arg).

So my question is how to know the count of training passes in order not to get retrained model?

P.S. should I use holdout_period argument? and how?


The test command in the question is incorrect. It has no input (-p ... indicates output predictions). Also it is not clear if you want to test or predict because it says test but the command used has -p ...

Test means you have labeled-data and you're evaluating the quality of your model. Strictly speaking: predict means you don't have labels, so you can't actually know how good your predictions are. Practically, you may also predict on held-out, labeled data (pretending it has no labels by ignoring them) and then evaluate how good these predictions are, since you actually have labels.


  • if you want to do binary-classification, you should use labels in {-1, 1} and use --loss_function logistic. --binary which is an independent option meaning you want predictions to be binary (giving you less info).

  • if you already have a separate test-set with labels, you don't need to holdout.

The holdout mechanism in vw was designed to replace the test-set and avoid over-fitting, it is only relevant when multiple passes are used because in a single pass all examples are effectively held-out; each next (yet unseen) example is treated as 1) unlabeled for prediction, and as 2) labeled for testing and model-update. IOW: your train-set is effectively also your test-set.

So you can either do multiple passes on the train-set with no holdout:

 vw --loss_function logistic --nn 4 -c --passes 2 --holdout_off train.vw -f model

and then test the model with a separate and labeled, test-set:

 vw -t -i model test.vw

or do multiple passes on the same train-set with some hold-out as a test set.

vw --loss_function logistic --nn 4 -c --passes 20 --holdout_period 7 train.vw -f model

If you don't have a test-set, and you want to fit-stronger by using multiple-passes, you can ask vw to hold-out every Nth example (the default N is 10, but you may override it explicitly using --holdout_period <N> as seen above). In this case, you can specify a higher number of passes because vw will automatically do early-termination when the loss on the held-out set starts growing.

You'd notice you hit early termination since vw will print something like:

passes used = 5
average loss = 0.06074 h

Indicating that only 5 out of N passes were actually used before early stopping, and the error on the held-out subset of example is 0.06074 (the trailing h indicates this is held-out loss).

As you can see, the number of passes, and the holdout-period are completely independent options.

To improve and get more confidence in your model, you could use other optimizations, vary the holdout_period, try other --nn args. You may also want to check the vw-hypersearch utility (in the utl subdirectory) to help find better hyper-parameters.

Here's an example of using vw-hypersearch on one of the test-sets included with the source:

$ vw-hypersearch 1 20 vw --loss_function logistic --nn % -c --passes 20 --holdout_period 11 test/train-sets/rcv1_small.dat --binary
trying 13 ............. 0.133333 (best)
trying 8 ............. 0.122222 (best)
trying 5 ............. 0.088889 (best)
trying 3 ............. 0.111111
trying 6 ............. 0.1
trying 4 ............. 0.088889 (best)
loss(4) == loss(5): 0.088889
5       0.08888

Indicating that either 4 or 5 should be good parameters for --nn yielding a loss of 0.08888 on a hold-out subset of 1 in 11 examples.


I know that the syntax for training a single-layer neural network is:

vw -d data.vw --nn 10

(Thanks, FastML)

What if I'd like to add a second layer, say with 5 nodes? Is that possible?


For those of you interested in using VW neural nets in an applied setting I've posted a public Google drive linked to a worked example relating to this question here:


VW doesn't have the ability to construct more than 1 hidden layer (using -nn X). You'd have to use a different non-linear algorithm or a different framework.