Hot questions for Using Neural networks in random forest

Question:

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.

The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.

I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.

Simple linear regression on this data set has a mean squared error of about 0.14.

I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.

So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.

Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py

... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.

Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.

import numpy as np
import pandas as pd
import torch
import torch.nn as nn

# data
df = pd.read_csv("test.csv")
print(df)
print()

# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)

# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)

# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)

# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500

# model
class Net(nn.Module):
    def __init__(self, hidden_size):
        super(Net, self).__init__()
        self.L0 = nn.Linear(in_features, hidden_size)
        self.N0 = nn.ReLU()
        self.L1 = nn.Linear(hidden_size, hidden_size)
        self.N1 = nn.Tanh()
        self.L2 = nn.Linear(hidden_size, hidden_size)
        self.N2 = nn.ReLU()
        self.L3 = nn.Linear(hidden_size, 1)

    def forward(self, x):
        x = self.L0(x)
        x = self.N0(x)
        x = self.L1(x)
        x = self.N1(x)
        x = self.L2(x)
        x = self.N2(x)
        x = self.L3(x)
        return x


model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# train
print("training")
for epoch in range(1, epochs + 1):
    # forward
    output = model(X_tensor)
    cost = criterion(output, y_tensor)

    # backward
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    # print progress
    if epoch % (epochs // 10) == 0:
        print(f"{epoch:6d} {cost.item():10f}")
print()

output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())

Answer:

can you please print the shape of your input ? I would say check those things first:

  • that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
  • your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
  • place the optimizer.zero_grad at the first line inside the for loop
  • normalize/standardize your data ( this is usually good for NNs )
  • remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
  • use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)

Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

Question:

I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’. The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter


Answer:

see varImp() in library caret, which supports all the ML algorithms you referenced.

Question:

I'm currently making a machine learning model for a student project, and I'm still deciding what model I should use. Here's the brief I was given:

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group. Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.

The data frame has:

  • 134 columns, about 100,000 rows
  • many of the columns have missing values
  • I've only been given 5 days to submit my final work, so I can't spend a prolonged period training the model

I'm leaning towards using a backpropogation neural network, as I believe it can handle the missing values, though a random forest might also be viable given the limited amount of time I have to train it. I've done a lot of research on the various pros and cons of common ML models, but any additional advise would be greatly appreciated.


Answer:

It would be easier to answer this question if you tried several candidate methods and described why they don't suffice, but here's one place to start... If you didn't have access to a computer and someone gave you this table and asked you to qualitatively describe how terrorism works, you might notice very quickly, say, that Irish Republican Army doesn't operate in Afghanistan and only ISIS is involved in attacks that kill more than 1000 people (let's stipulate). This observation is akin to how a random forest operates on categorical and continuous data respectively.

The point is that your brain gravitates towards a random forest when trying to qualitatively describe the fundamental reality behind data like this. (Multiple splits would look like... well there was no terrorism in America before 1991 and after 1991 most terrorist attacks in America have involved groups X, Y, and Z -- and so forth) A corollary of this is that you will have a lot to say about what your trained random forest is telling you, where it fails, and why it fails for where it fails.

If you use a neural network, without knowing a lot about the details of how it works, you might end up mindlessly tuning things until something seems to work and have no idea what to say about how well it works for various situations or which features are informative.

why not use a random forest, find out where it does and does not work, contemplate this result, and iterate on that?