Hot questions for Using Neural networks in feature selection


I need to do dimensionality reduction from a series of images. More specifically, each image is a snapshot of a ball moving and the optimal features would be its position and velocity. As far as I know, CNN are the state-of-the-art for reducing the features for image classification, but in that case only a single frame is provided. Is it possible to extract also time-dependent features given many images at different time steps? Otherwise which is the state-of-the-art techniques for doing so?

It's the first time I use CNN and I would also appreciate any reference or any other suggestion.


If you want to be able to have the network somehow recognize a progression which is time dependent, you should probably look into recurrent neural nets (RNN). Since you would be operating on video, you should look into recurrent convolutional neural nets (RCNN) such as in:

Recurrence adds some memory of a previous state of the input data. See this good explanation by Karpathy:

In your case you need to have the recurrence across multiple images instead of just within one image. It would seem like the first problem you need to solve is the image segmentation problem (being able to pick the ball out of the rest of the image) and the first paper linked above deals with segmentation. (then again, maybe you're trying to take advantage of the movement in order to identify the moving object?)

Here's another thought: perhaps you could only look at differences between sequential frames and use that as your input data to your convnet? The input "image" would then show where the moving object was in the previous frame and where it is in the current one. Larger differences would indicate larger amounts of movement. That would probably have a similar effect to using a recurrent network.


I have a dataset from Multiple Object Tracking trials, where a participant follows 8 points on a display, 4 of which are targets (marked briefly at the beginning of the trial) and 4 are distractors. At the end of the trial, the person marks the 4 targets. My dataset only includes trials where the participant's reply was correct. I have 10 frames per second, each frame includes the positions of the points and the position of the eye gaze, so 18 numbers in total. A trial lasts 8 seconds. There are 40 possible trajectories for the points.

I'm trying to train a neural net to mark the 4 targets solely based on the positions of gaze and the points. The problem is, in the data set, the answer is always the first 4 points in the vector. If I used these outputs for training, the net would just learn to always say [1,1,1,1,0,0,0,0]. Is there a way I can alter the input or output(or both) — by computing different features for example — so that it doesn't matter to the net in what order it recieved the points? The fact that a point's coordinates are the first(second, third…) in the input vector conveys no meaning in this task.

What I tried so far:

  • during training, permute each input randomly (and output correspodningly) and iterate over all 70 possible permutations of the output vector [1,1,1,1,0,0,0,0] so that the permutations are represented equally in the training. Didn't work (success rate was 1/70, which is equal to chance)
  • sort the points from left to right (by x-coordinate) - results improved, but the net basically memorized the trajectory and worked equally well even when I removed the eye gaze position. Of course I want the net to reply correctly even for new trajectories that it wasn't trained on

I have an idea for input features, I could partition the display into a discrete mesh and put 1 where there is a point, some other number where the position of the gaze is and 0 elsewhere. I don't know however, what would the output look like, any ideas?

I know I can't find an answer about a whole trial from one frame, so I'm hoping to combine the outputs of the net for all 80 frames of a trial and find the answer from that.

I'm not even sure there is any hope an NN would manage to learn this. Are there any machine-learning models that are permutation-invariant? I have searched for a long time and found nothing.


Take a look at PointNet architecture. They are solving a similar problem but in 3D.

The basic approach is as follows. Feed all the points to an embedding layer which maps x and y coordinates into a higher dimensional space. These are the local features of the points. Then, feed all the local features into a "global feature extractor" module whose last layer is a max-pool. The output of this module represents the whole input and the max-pool at the end guarantees permutational invariance (or "symmetry"). Then, concatenate all the local features with the global feature and you get the complete feature set for each point. Finally, map each point's features to the point's class via a dense layer and you are done.

If you take a look at the PointNet source code, you will see that it is quite easy to implement this architecture.


I am trying to use a neural network for binary and multi-class classification. My dataset has binary, numeric and nominal variables. The nominal values on training set has a lot of values, so when I perform OneHotEncoding the dimension moves from 42 to 122. Also some of the values are only present on training set because the dataset was proposed this way.

So I used the following order:

  1. One-hot encoding
  2. Normalization
  3. Feature Selection or PCA

But I found some people, who also used neural networks, performed feature selection before even performing One-hot Encoding. Which is strange for me because neural network only work numerical data. So running a feature selection algorithm that might delete the categorical values could be harming to the neural network, especially that one hot encoding has an impact on the dimensional of the whole model.

But I don't know, so I have to ask: What is the correct order here? This thread follows the order I used, but I am more interested about the one-hot encoding and feature selection part


qu: What is the correct order here? This order may vary based on your application and data.

For example in your qu as why feature selection used before oneHotEncoding, it is applicable in your nominal data which you state that: "The nominal values on training set has a lot of values, so when I perform OneHotEncoding the dimension moves from 42 to 122" In this application it is useful to do feature selection before oneHotEncoding.

  • "So running a feature selection algorithm that might delete the categorical values could be harming to the neural network, especially that one hot encoding has an impact on the dimensional of the whole model" -> your interpretation of the neural network in this way is not correct at all, because the existence of useless categorical values to one hot encoding may make the tuning of neural network hard (or result in no convergence on neural network tuning) or computational complexity.


I am training neural network using Keras. Every time I train my model, I use slightly different set of features selected using Tree-based feature selection via ExtraTreesClassifier(). After training every time, I compute the AUCROC on my validation set and then go back in a loop to train the model again with different set of feature. This process is very inefficient and I want to select the optimum number of features using some optimization technique available in some python library. The function to be optimized is the auroc for cross validation which can only be calculated after training the model on selected features. The features are selected via following function ExtraTreesClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’) Here we see that the objective function is not directly dependent on the parameters to be optimized. The objective function which is auroc is related to the neural network training and the neural network takes features as input which are extracted on the basis of their important from ExtraTreesClassifier. So in a way, the parameters for which I optimize auroc are n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’ or some other variables in ExtraTreesClassifier. These are not directly related to auroc.


You should combine GridSearchCV and Pipeline. Find more here Use Pipeline when you need to run a set of instruction in sequence to get the optimal config.

For example, you have these steps to run: 1. Select KBest feature(s) 2. Use classifier DecisionTree or NaiveBayes

By combining GridSearchCV and Pipeline, you can select which features that best for a particular classifier, best config on the classifier, and so on, based on the scoring criteria.


#set your configuration options 
param_grid = [{
    'classify': [DecisionTreeClassifier()], #first option use DT
    'kbest__k': range(1, 22), #range of n in SelectKBest(n)

    #classifier's specific configs
    'classify__criterion': ('gini', 'entropy'), 
    'classify__min_samples_split': range(2,10),
    'classify__min_samples_leaf': range(1,10)
    'classify': [GaussianNB()], #second option use NB
    'kbest__k': range(1, 22), #range of n in SelectKBest(n)

pipe =  Pipeline(steps=[("kbest", SelectKBest()), ("classify",  DecisionTreeClassifier())]) #I put DT as default, but eventually the program will ignore this when you use GridSearchCV.

# Here the might of GridSearchCV working, this may takes time especially if you have more than one classifiers to be evaluated
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10, scoring='f1'), labels)

#Find your best params if you want to use optimal setting later without running the grid search again (by commenting all these grid search lines)
print grid.best_params_

#You can now use pipeline again to wrap the steps with it best configs to build your model
pipe =  Pipeline(steps=[("kbest", SelectKBest(k=12)), ("classify",  DecisionTreeClassifier(criterion="entropy", min_samples_leaf=2, min_samples_split=9))])

Hope this helps


I have a data set which have 2 features and 10000 samples. I would like to convert(integrate) these two features into one feature, for further analysis. So I want to use feature extraction method. As the relationship between two features are not linear, I want to use methods other than conventional PCA.

Because the number of samples are much larger than that of features, I think autoencoder is a good way for feature extraction. But the input feature is only 2, then the shape of autoencoder will be only 2-1-2, which is a linear extraction.

Is it possible to set hidden nodes more than the number of inputs and make stacked autoencoder, such as 2-16-8-1-8-16-2 nodes?

Also, it a good choice to use autoencoder for this kind of data integration? If not, are there any better solutions?


Why would this be a linear extraction? If you use any non-regularity in the hidden and output layer you will get a non-linear relationship between them. Your encoding will in essential be sigmoid(Ax + b).

If you truly want to make your network more complex I would suggest using multiple 2 neuron layers before the single neuron layer. So something like this 2 - 2 - 2 - 1 - 2 - 2 - 2 nodes. I do not see any reason why you would need to make it larger.