Hot questions for Using Neural networks in feature extraction

Question:

I've seen some tutorial examples, like UFLDL covolutional net, where they use features obtained by unsupervised learning, or some others, where kernels are engineered by hand (using Sobel and Gabor detectors, different sharpness/blur settings etc). Strangely, I can't find a general guideline on how one should choose a good kernel for something more than a toy network. For example, considering a deep network with many convolutional-pooling layers, are the same kernels used at each layer, or does each layer have its own kernel subset? If so, where do these, deeper layer's filters come from - should I learn them using some unsupervised learning algorithm on data passed through the first convolution-and-pooling layer pair?

I understand that this question doesn't have a singular answer, I'd be happy to just the the general approach (some review article would be fantastic).


Answer:

The current state of the art suggest to learn all the convolutional layers from the data using backpropagation (ref).

Also, this paper recommend small kernels (3x3) and pooling (2x2). You should train different filters for each layer.

Question:

My question is can we use CNN for feature extraction and then can we use this extracted feature as an input to another classification algorithm like SVM.

Thanks


Answer:

Yes, this has already been done and well documented in several research papers, like CNN Features off-the-shelf: an Astounding Baseline for Recognition and How transferable are features in deep neural networks?. Both show that using CNN features trained on one dataset, but tested on a different one usually perform very well or beat the state of the art.

In general you can take the features from the layer before the last, normalize them and use them with another classifier.

Another related technique is fine tuning, where after training a network, the last layer is replaced and retrained, but previous layers' weights are kept fixed.

Question:

In the book Deep Learning with Python by François Chollet (creator of Keras), section 5.3 (see the companion Jupyter notebook), the following is unclear to me:

Let's put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from our cat and dog images, and then training a cat vs. dog classifier on top of these features.

[...]

There are two ways we could proceed:

  • Running the convolutional base over our dataset, recording its output to a Numpy array on disk, then using this data as input to a standalone densely-connected classifier similar to those you have seen in the first chapters of this book. This solution is very fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. However, for the exact same reason, this technique would not allow us to leverage data augmentation at all.
  • Extending the model we have (conv_base) by adding Dense layers on top, and running the whole thing end-to-end on the input data. This allows us to use data augmentation, because every input image is going through the convolutional base every time it is seen by the model. However, for this same reason, this technique is far more expensive than the first one.

Why can't we augment our data (generate more images from the existing data), run the convolutional base over the augmented dataset (one time), record its output and then use this data as input to a standalone fully-connected classifier?

Wouldn't it give similar results to the second alternative but be faster?

What am I missing?


Answer:

Wouldn't it give similar results to the second alternative but be faster?

Similar results yes, but would it really be faster?

The main point of Chollet here is that the second way is more expensive simply due to the larger number of images caused by the augmentation procedure itself; while the first approach

only requires running the convolutional base once for every input image

in the second

every input image is going through the convolutional base every time it is seen by the model [...] for this same reason, this technique is far more expensive than the first one

since

the convolutional base is by far the most expensive part of the pipeline

where "every time it is seen by the model" must be understood as "in every version produced by the augmentation procedure" (agree, the wording could and should be clearer here...).

There is no walkaround from this using your proposed method. It's a valid alternative version of the second way, sure, but there is no reason to believe it will actually be faster, taking into account the whole end-to-end process (CNN+FC) in both cases...

UPDATE (after comment):

Maybe you are right, but I still have a feeling of missing something since the author explicitly wrote that the first method "would not allow us to leverage data augmentation at all".

I think you are just over-reading things here - although, again, the author arguably could and should be clearer; as written, Chollet's argument here is somewhat circular (it can happen to the best of us): since we run "the convolutional base [only] once for every input image", it turns out by definition that we don't use any augmentation... Interestingly enough, the phrase in the book (p. 146) is slightly different (less dramatic):

But for the same reason, this technique won’t allow you to use data augmentation.

And what is that reason? But of course that we feed each image to the convolutional base only once...

In other words, it's not in fact that we are not "allowed" to, but rather that we have chosen not to augment (in order to be faster, that is)...

Question:

I am attempting to pull out these y values from neural network. The current problem seems to be numpy not multiplying the matrix as I expected. I have included the code and output for your review. Thank you in advance for any insights you can provide.

def columnToRow(column):
    newarray = np.array([column])
    return newarray


def calcIndividualOutput(indivInputs,weights,biases):
  # finds the resulting y values for one set of input data
  I_transposed= columnToRow(indivInputs)
  output = np.multiply(I_transposed, weights) + biases
  return output


def getOutputs(inputs,weights,biases):
  # iterates over each set of inputs to find corresponding outputs 
  # returns output matrix
  i_len = len(inputs)-1
  outputs = []
  for i in range(0,i_len):
    result = calcIndividualOutput(inputs[i],weights,biases)
    outputs.append(np.tanh(result))
    if (i==i_len):
      print("Final Input reached:", i)
  return outputs



# Test Single line of Outputs should
#print("Resulting Outputs0:\n\n",resultingOutputs[0,0:])

# Testing 
currI=data[0]
Itrans=columnToRow(currI)
print(" THE CURRENT I0\n\n",currI,"\n\n")
print("transposed I:\n\n",Itrans,"\n\n")
print("Itrans shape:\n\n",Itrans.shape,"\n\n")

print("Current biases:\n\n",model_l1_b,"\n\n")
print("Current biases shape:\n\n",model_l1_b.shape,"\n\n")
print("B trans:",b_trans,"\n\n")
print("B trans shape:",b_trans.shape,"\n\n")

print("Current weights:\n\n",model_l1_W,"\n\n")
print("Transposed weights\n\n",w_transposed,"\n\n")
print("wtrans shape:\n\n",w_transposed.shape,"\n\n")



#Test calcIndividualOutput

testOutput= calcIndividualOutput(currI,w_transposed,b_trans)
print("Test calcIndividualOutput:\n\n",testOutput,"\n\n")
print("Test calcIndividualOutput Shape:\n\n",testOutput.shape,"\n\n")

# Transpose weights to match dimensions of input
b_trans=columnToRow(model_l1_b)
w_transposed=np.transpose(model_l1_W)
resultingOutputs = getOutputs(data,w_transposed,b_trans)

Output:

THE CURRENT I0

 [-0.66399151 -0.59143853  0.5230611  -0.52583802 -0.31089544  0.47396523
 -0.7301591  -0.21042131  0.92044264 -0.48792791 -1.54127669] 


transposed I:

 [[-0.66399151 -0.59143853  0.5230611  -0.52583802 -0.31089544  0.47396523
  -0.7301591  -0.21042131  0.92044264 -0.48792791 -1.54127669]] 


Itrans shape:

 (1, 11) 


Current biases:

 [ 0.04497563 -0.01878226  0.03285328  0.00443657 -0.10408497  0.03982726
 -0.07724283] 


Current biases shape:

 (7,) 


B trans: [[ 0.04497563 -0.01878226  0.03285328  0.00443657 -0.10408497  0.03982726
  -0.07724283]] 


B trans shape: (1, 7) 


Current weights:

 [[ 0.02534341  0.01163373 -0.20102289  0.23845847  0.20859972 -0.09515963
   0.00744185 -0.06694793 -0.03806938  0.02241485  0.34134269]
 [ 0.0828636  -0.14711063  0.44623381  0.0095899   0.41908434 -0.25378567
   0.35789928  0.21531652 -0.05924326 -0.18556432  0.23026766]
 [-0.23547475 -0.18090464 -0.15210266  0.10483326 -0.0182989   0.52936584
   0.15671678 -0.64570689 -0.27296376  0.28720504  0.21922119]
 [-0.17561196 -0.42502806 -0.34866759 -0.07662395 -0.02361901 -0.10330012
  -0.2626377   0.19807351  0.20543958 -0.34499851  0.29347673]
 [-0.04404973 -0.31600055 -0.22984107  0.21733086 -0.15065287  0.18301299
   0.13399698  0.11884601  0.04380761 -0.03720044  0.0146924 ]
 [ 0.25086868  0.15678053  0.30350113  0.13065964 -0.30319506  0.47015968
   0.00549904  0.32486886 -0.00331726  0.22858304  0.16789439]
 [-0.10196115 -0.03687141 -0.28674102  0.01066647  0.2475083   0.15808311
  -0.1452509   0.09170815 -0.14578934 -0.07375327 -0.16524883]] 


Transposed weights

 [[ 0.02534341  0.0828636  -0.23547475 -0.17561196 -0.04404973  0.25086868
  -0.10196115]
 [ 0.01163373 -0.14711063 -0.18090464 -0.42502806 -0.31600055  0.15678053
  -0.03687141]
 [-0.20102289  0.44623381 -0.15210266 -0.34866759 -0.22984107  0.30350113
  -0.28674102]
 [ 0.23845847  0.0095899   0.10483326 -0.07662395  0.21733086  0.13065964
   0.01066647]
 [ 0.20859972  0.41908434 -0.0182989  -0.02361901 -0.15065287 -0.30319506
   0.2475083 ]
 [-0.09515963 -0.25378567  0.52936584 -0.10330012  0.18301299  0.47015968
   0.15808311]
 [ 0.00744185  0.35789928  0.15671678 -0.2626377   0.13399698  0.00549904
  -0.1452509 ]
 [-0.06694793  0.21531652 -0.64570689  0.19807351  0.11884601  0.32486886
   0.09170815]
 [-0.03806938 -0.05924326 -0.27296376  0.20543958  0.04380761 -0.00331726
  -0.14578934]
 [ 0.02241485 -0.18556432  0.28720504 -0.34499851 -0.03720044  0.22858304
  -0.07375327]
 [ 0.34134269  0.23026766  0.21922119  0.29347673  0.0146924   0.16789439
  -0.16524883]] 


wtrans shape:

 (11, 7) 


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-162-7e8be1d52690> in <module>()
     48 #Test calcIndividualOutput
     49 
---> 50 testOutput= calcIndividualOutput(currI,w_transposed,b_trans)
     51 print("Test calcIndividualOutput:\n\n",testOutput,"\n\n")
     52 print("Test calcIndividualOutput Shape:\n\n",testOutput.shape,"\n\n")

<ipython-input-162-7e8be1d52690> in calcIndividualOutput(indivInputs, weights, biases)
      7   # finds the resulting y values for one set of input data
      8   I_transposed= columnToRow(indivInputs)
----> 9   output = np.multiply(I_transposed, weights) + biases
     10   return output
     11 

ValueError: operands could not be broadcast together with shapes (1,11) (11,7) 

Answer:

np.multiply is for multiplying arrays element-wise, but from the dimensions of you data I guess that you are looking for matrix multiplication. To get that use np.dot.

Question:

What kind of filter should I use to extract feature maps in convolutional NN?

I been reading about convolutional NN recently and I understood that we use a set of filters to generate a set of feature maps in each convolution layer by convoluting those filters over outputs from the previous layer.

1)How do we get these filters?

2)Do we pick filters randomly and do some 'trial and error'?

3)How do we find perfect filters for our project?

Thank you.


Answer:

1) You don't directly get it, you let the network get it by showing some examples (training data)

2) We do initialise the filters randomly so that they can learn something useful and different to each.

3) By providing lots of data that are relevant to your project.

Question:

I have a dataset of four emotion labelled tweets (anger, joy, fear, sadness). For instance, I transformed tweets to a vector similar to the following input vector for anger:

Mean of frequency distribution to anger tokens

word2vec similarity to anger

Mean of anger in emotion lexicon

Mean of anger in hashtag lexicon

Is that vector valid to train a neural network?


Answer:

Your input vector looks fine to start with. Of-course, you might later make it much advanced with statistical and derivative data from twitter or other relevant APIs or datasets.

Your network has four outputs, just like you mentioned:

Joy: [1,0,0,0] Sadness: [0,1,0,0] Fear: [0,0,1,0] Anger: [0,0,0,1]

And you may consider adding multiple hidden layers and make it a deep network, if you wish, to increase stability of your neural network prototype.

As your question also shows, it may be best to have a good preprocessor and feature extraction system, prior to training and testing your data, which it certainly seems you know, where the project is going.

Great project, best wishes, thank you for your good question and welcome to stackoverflow.com!

Playground Tensorflow