Hot questions for Using Neural networks in dataframe

Question:

I have a training set that looks like

Name       Day         Area         X    Y    Month Night
ATTACK    Monday   LA           -122.41 37.78   8      0
VEHICLE  Saturday  CHICAGO      -1.67    3.15   2      0
MOUSE     Monday   TAIPEI       -12.5    3.1    9      1

Name is the outcome/dependent variable. I converted Name, Area and Day into factors, but I wasn't sure if I was supposed to for Month and Night, which only take on integer values 1-12 and 0-1, respectively.

I then convert the data into matrix

ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)

I then setup tuning the parameters

nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=ynn, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)

However, I get the error Error: nrow(x) == n is not TRUE for the model<-train

I also get a similar error if I use xgboost instead of nnet

Anyone know whats causing this?


Answer:

y should be a numeric or factor vector containing the outcome for each sample, not a matrix. Using

train(y = make.names(trainDF$Name), ...)

helps, where make.names modifies values so that they could be valid variable names.

Question:

I'm trying to run a kNN classifier across my dataset using 10-fold CV. I have some experience with models in WEKA but struggling to transfer this over to Sklearn.

Below is my code

filename = 'train4.csv'
names = ['attribute names are here']

df = pandas.read_csv(filename, names=names)

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, df.drop('mix1_instrument', axis=1), df['mix1_instrument'], cv=kfold)
print(results.mean())

I am receiving this error

 ValueError: could not convert string to float: ''

How can I convert this attribute? And this contains useful information for classifying my instances would a conversion impact this?

There are two attributes that are 'object' that I believe need converting named 'class1' and class2'

Sample data below...

{
    'temporalCentroid': {
        0: 'temporalCentroid',
        1: '1.67324',
        2: '1.330722',
        3: '0.786984',
        4: '1.850129'
    },
    'LogSpecCentroid': {
        0: 'LogSpecCentroid',
        1: '-1.043802',
        2: '-0.82943',
        3: '-2.441297',
        4: '-0.837145'
    },
    'LogSpecSpread': {
        0: 'LogSpecSpread',
        1: '0.747558',
        2: '1.378373',
        3: '0.667634',
        4: '1.238404'
    },
    'MFCC1': {
        0: 'MFCC1',
        1: '3.502117',
        2: '6.697601',
        3: '4.011488',
        4: '0.823614'
    },
    'MFCC2': {
        0: 'MFCC2',
        1: '-9.208897',
        2: '-9.741549',
        3: '15.27665',
        4: '-15.22256'
    },
    'MFCC3': {
        0: 'MFCC3',
        1: '-2.334097',
        2: '-9.868089',
        3: '0.802509',
        4: '-4.978688'
    },
    'MFCC4': {
        0: 'MFCC4',
        1: '-9.013086',
        2: '0.609091',
        3: '2.50685',
        4: '-2.489553'
    },
    'MFCC5': {
        0: 'MFCC5',
        1: '4.847481',
        2: '1.733307',
        3: '0.10459',
        4: '1.066615'
    },
    'MFCC6': {
        0: 'MFCC6',
        1: '-4.770421',
        2: '-5.381835',
        3: '-0.260118',
        4: '-1.020861'
    },
    'MFCC7': {
        0: 'MFCC7',
        1: '-3.362488',
        2: '-1.261088',
        3: '0.593255',
        4: '-2.007349'
    },
    'MFCC8': {
        0: 'MFCC8',
        1: '-9.527529',
        2: '-3.809237',
        3: '-0.362287',
        4: '-8.938164'
    },
    'MFCC9': {
        0: 'MFCC9',
        1: '-9.629579',
        2: '1.486923',
        3: '-2.957592',
        4: '-2.324424'
    },
    'MFCC10': {
        0: 'MFCC10',
        1: '1.848685',
        2: '-3.938455',
        3: '-1.884439',
        4: '-2.535579'
    },
    'MFCC11': {
        0: 'MFCC11',
        1: '-2.311295',
        2: '-2.159865',
        3: '-0.827179',
        4: '0.638553'
    },
    'MFCC12': {
        0: 'MFCC12',
        1: '-7.696675',
        2: '-3.138412',
        3: '-0.605056',
        4: '-1.116259'
    },
    'MFCC13': {
        0: 'MFCC13',
        1: '10.35572',
        2: '9.095669',
        3: '6.426399',
        4: '15.04535'
    },
    'MFCCMin': {
        0: 'MFCCMin',
        1: '-9.629579',
        2: '-9.868089',
        3: '-2.957592',
        4: '-15.22256'
    },
    'MFCCMax': {
        0: 'MFCCMax',
        1: '10.35572',
        2: '9.095669',
        3: '15.27665',
        4: '15.04535'
    },
    'MFCCSum': {
        0: 'MFCCSum',
        1: '-37.300064',
        2: '-19.675939',
        3: '22.82507',
        4: '-23.059305'
    },
    'MFCCAvg': {
        0: 'MFCCAvg',
        1: '-2.869235692',
        2: '-1.513533769',
        3: '1.755774615',
        4: '-1.773792692'
    },
    'MFCCStd': {
        0: 'MFCCStd',
        1: '6.409842944',
        2: '5.558499123',
        3: '4.756836281',
        4: '6.76039911'
    },
    'Energy': {
        0: 'Energy',
        1: '-2.96148',
        2: '-3.522993',
        3: '-3.409359',
        4: '-2.235853'
    },
    'ZeroCrossings': {
        0: 'ZeroCrossings',
        1: '128',
        2: '188',
        3: '43',
        4: '288'
    },
    'SpecCentroid': {
        0: 'SpecCentroid',
        1: '284.0513',
        2: '414.8489',
        3: '102.2096',
        4: '405.1262'
    },
    'SpecSpread': {
        0: 'SpecSpread',
        1: '207.5526',
        2: '350.7937',
        3: '53.52178',
        4: '360.0353'
    },
    'Rolloff': {
        0: 'Rolloff',
        1: '263.7817',
        2: '783.2703',
        3: '129.1992',
        4: '912.4695'
    },
    'Flux': {
        0: 'Flux',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'bandsCoefMin': {
        0: 'bandsCoefMin',
        1: '-0.224957',
        2: '-0.247903',
        3: '-0.22283',
        4: '-0.232534'
    },
    'bandsCoefMax': {
        0: 'bandsCoefMax',
        1: '-0.074945',
        2: '-0.113654',
        3: '-0.062254',
        4: '-0.080883'
    },
    'bandsCoefSum1': {
        0: 'bandsCoefSum1',
        1: '-5.575428',
        2: '-5.524777',
        3: '-5.511125',
        4: '-5.532536'
    },
    'bandsCoefAvg': {
        0: 'bandsCoefAvg',
        1: '-0.168952364',
        2: '-0.167417485',
        3: '-0.167003788',
        4: '-0.167652606'
    },
    'bandsCoefStd': {
        0: 'bandsCoefStd',
        1: '0.042580181',
        2: '0.048429973',
        3: '0.049881374',
        4: '0.0475839'
    },
    'bandsCoefSum': {
        0: 'bandsCoefSum',
        1: '382.5963',
        2: '360.9232',
        3: '384.3541',
        4: '368.9903'
    },
    'prjmin': {
        0: 'prjmin',
        1: '-0.999362',
        2: '-0.999719',
        3: '-0.988315',
        4: '-0.999421'
    },
    'prjmax': {
        0: 'prjmax',
        1: '0.023797',
        2: '0.009596',
        3: '0.028112',
        4: '0.024612'
    },
    'prjSum': {
        0: 'prjSum',
        1: '-0.99911',
        2: '-1.006792',
        3: '-1.084054',
        4: '-1.002478'
    },
    'prjAvg': {
        0: 'prjAvg',
        1: '-0.030276061',
        2: '-0.030508848',
        3: '-0.032850121',
        4: '-0.030378121'
    },
    'prjStd': {
        0: 'prjStd',
        1: '0.174082468',
        2: '0.174040569',
        3: '0.173600498',
        4: '0.174064118'
    },
    'LogAttackTime': {
        0: 'LogAttackTime',
        1: '0.365883',
        2: '-0.35427',
        3: '-0.669283',
        4: '-0.026181'
    },
    'HamoPkMin': {
        0: 'HamoPkMin',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'HamoPkMax': {
        0: 'HamoPkMax',
        1: '1.025473',
        2: '1.05761',
        3: '0.986766',
        4: '0.957316'
    },
    'HamoPkSum': {
        0: 'HamoPkSum',
        1: '14.391206',
        2: '20.306125',
        3: '9.727358',
        4: '14.772449'
    },
    'HamoPkAvg': {
        0: 'HamoPkAvg',
        1: '0.513971643',
        2: '0.72521875',
        3: '0.347405643',
        4: '0.527587464'
    },
    'HamoPkStd': {
        0: 'HamoPkStd',
        1: '0.376622124',
        2: '0.325929503',
        3: '0.388971641',
        4: '0.381693476'
    },
    'class1': {
        0: 'class1',
        1: 'aerophone',
        2: 'aerophone',
        3: 'chordophone',
        4: 'aerophone'
    },
    'class2': {
        0: 'class2',
        1: 'aero_single-reed',
        2: 'aero_lip-vibrated',
        3: 'chrd_simple',
        4: 'aero_single-reed'
    },
    'mix1_instrument': {
        0: 'mix1_instrument',
        1: 'Saxophone',
        2: 'Trumpet',
        3: 'Piano',
        4: 'Clarinet'
    }
}

Thanks


Answer:

Here is a small demo:

Source DF:

In [43]: df
Out[43]:
     Energy  HamoPkStd       class1             class2 mix1_instrument
0 -2.961480  14.391206    aerophone   aero_single-reed       Saxophone
1 -3.522993  20.306125  chordophone  aero_lip-vibrated         Trumpet
2 -3.409359   9.727358    aerophone        chrd_simple           Piano

Labels encoding:

In [44]: %paste
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

str_cols = df.columns[df.columns.str.contains('(?:class|instrument)')]
clfs = {c:LabelEncoder() for c in str_cols}

for col, clf in clfs.items():
    df[col] = clfs[col].fit_transform(df[col])
## -- End pasted text --

Result - all text/string columns have been converted to numbers, so we can feed it to Neural Networks:

In [45]: df
Out[45]:
     Energy  HamoPkStd  class1  class2  mix1_instrument
0 -2.961480  14.391206       0       1                1
1 -3.522993  20.306125       1       0                2
2 -3.409359   9.727358       0       2                0

Inverse transfomration:

In [48]: clfs['class1'].inverse_transform(df['class1'])
Out[48]: array(['aerophone', 'chordophone', 'aerophone'], dtype=object)

In [49]: clfs['mix1_instrument'].inverse_transform(df['mix1_instrument'])
Out[49]: array(['Saxophone', 'Trumpet', 'Piano'], dtype=object)

Question:

I want to train a neural network (Multi-Perceptron) with the following data:

1              2              3             Other Field   Label
[1, 2, 3, 4]   [5, 6, 7, 8]   [9, 10, 11]   1234          5678
etc...

Here 1, 2 and 3 are columns that contain a list. The other two columns just have numeric values.

Only I keep getting this:

ValueError: setting an array element with a sequence.

Is this even possible?

Edit: My code to train the neural network is as follows:

from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(alpha=1e-5, hidden_layer_sizes=(10, 10), random_state=1)
mlp.fit(X_train, y_train)

Here's a screenshot of my train data:

And my label is just one column with numbers.


Answer:

If your lists always have the same length, it's just an issue of splitting each list-column into four individual columns, like described e.g. here:

# create a dataset
raw_data = {'score': [1,2,3], 
        'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}
df = pd.DataFrame(raw_data, columns = ['score', 'tags'])
# expand df.tags into its own dataframe
tags = df['tags'].apply(pd.Series)
# rename each variable
tags = tags.rename(columns = lambda x : 'tag_' + str(x))
# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)
df.drop('tags', inplace=True, axis=1)

If not, the best answer might be problem-specific. One approach could be to extend each list to the length of the longest list by padding with filler values and then doing the same.

Question:

I'm trying to create the most basic neural network from scratch to predict stocks for apple. The following code is what i have gotten to so far with assistance from looking at data science tutorials. However, I'm at the bit of actually feeding in the data and making sure it does so correctly.I would to feed in a pandas data frame of a stock trade. This is my view of the NN.

  • 5 Input nodes (Open,Close,High,Low,Volume) *note - this will be in a pandas data frame with a datetime index
  • AF that sums the weights of each input.
  • Sigmoid function to normalise the values
  • 1 output (adj close) *Not sure what I should use as the actual value

Then the process is to move back using the back-propagation technique.

import pandas as pd
import pandas_datareader as web
import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
    return 1.0/(1+ np.exp(-x))

def sigmoid_derivative(x):
    return x * (1.0 - x)

class NeuralNetwork:
    def __init__(self, x, y):
        self.input      = x
        self.weights1   = #will work out when i get the correct input
        self.weights2   = #will work out when i get the correct input                
        self.y          = y
        self.output     = #will work out 

    def feedforward(self):
        self.layer1 = sigmoid(np.dot(self.input, self.weights1))
        self.output = sigmoid(np.dot(self.layer1, self.weights2))

    def backprop(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
        d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))
        d_weights1 = np.dot(self.input.T,  (np.dot(2*(self.y - self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)))

        # update the weights with the derivative (slope) of the loss function
        self.weights1 += d_weights1
        self.weights2 += d_weights2


if __name__ == "__main__":
    X = #need help here
    y = #need help here
    nn = NeuralNetwork(X,y)

    for i in range(1500):
        nn.feedforward()
        nn.backprop()

    print(nn.output)

If you have any suggestions, corrections or anything please let me know because I am thrououghly invested into learning the neural networks.

Thanks.


Answer:

Directly using Pandas in a neural network would be absolutely ridiculous. The performance would be abysmal. What you do instead is pass in the underlying numpy array.

X = df[['Open','Close','High','Low','Volume']].values

y = df['adj close'].values

Does that answer the question?

Question:

I'm starting with Neural Networks and I'm having some issues with my data format. I have a pandas DataFrame with 130 rows, 4 columns and each data point is an array of 595 items.

      |      Col 1      |    Col 2        |    Col 3        |    Col 4        |
Row 1 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |
Row 2 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |
Row 3 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |

I created the X_train, X_test, y_train and y_test using train_test_split. However, when I check the shape of X_train it returns (52,4) and I'm not being able to create a model for my NN because it doesn't accept this shape. This is the error:

"ValueError: Error when checking input: expected dense_4_input to have 3 dimensions, but got array with shape (52, 4)"

I believe it's because it should be (52,4,595), right? So, I'm kind of confused, how can I specify this input_format correctly or maybe reshape my data for the appropriate data format?

I am using pandas, keras, tensorflow and jupyter notebook.


Answer:

You have to reshape your data to a 3d numpy array.

Suppose we have a data frame where each cell is a numpy array as you described

import pandas as pd
import numpy as np

data=pd.DataFrame(np.zeros((130,4))).astype('object')
for i in range(130):
    for k in range(4):
        #print(i,k)
        data.iloc[i,k]=np.zeros(595)

we can then reshape our data frame to a 3d numpy array doing:

dataar=data.values
dataar=np.stack((np.vstack(dataar[:,0]),np.vstack(dataar[:,1]),np.vstack(dataar[:,2]),np.vstack(dataar[:,3])))
dataar=dataar.reshape(130,4,595)
dataar.shape
# (130, 4, 595)

Question:

I have time-series dataframes that I want to use in conjunction with a convolutional neural network for pattern/anomaly detection.

Just wondering about how I can transform without losing essential data?


Answer:

Managed to form a tensor containing 3D arrays for analysis in Convolutional Neural Networks from a simple Data Frame using a moving window:

def windows(data, size):
    start = 0
    while start < len(data):
        #print(start,start+size)
        yield start, start + size
        print(start, start + size)
        start += 1

def segmentor(data,window_size,num_channels):
    segments=np.empty((0,window_size,num_channels)) #create dimensions for height component
    for (start,end) in windows(data,window_size):
        placeholder=data.iloc[int(start):int(end),:] #slices the dataframe to extract that time window
        #Now need to forgo the leftovers in each dataframe:
        if(len(placeholder)==window_size): #If the length of timewindow == specified time-window size,
            pl_=(np.dstack((placeholder.ix[:,i] for i in placeholder))) #stack the columns (depthwise)
            #print(pl_.shape)
            #pl_=pl_.swapaxes(1,2)
            segments=np.vstack([segments,pl_])
            #print(segments.shape)
    return segments

The resulting structures can then be passed to generic CNNs.

Question:

I think this is a simple question, but not for me( There is a table in df:

Date        X1  X2  Y1
07.02.2019  5   1   1
08.02.2019  6   2   1
09.02.2019  1   3   0
10.02.2019  4   4   1
11.02.2019  1   1   0
12.02.2019  4   2   1
13.02.2019  5   5   1
14.02.2019  6   5   1
15.02.2019  1   1   0
16.02.2019  4   5   1
17.02.2019  1   2   0
18.02.2019  1   1   
19.02.2019  2   1   
20.02.2019  3   2   
21.02.2019  4   14

I need to build a neural network for Y1 from the parameters X1 and X2 and then apply it to the lines with a date greater than 17.02.2019, And save the network prediction result in a separate df2

 import pandas as pd
    import numpy as np
    import re
    from sklearn.neural_network import MLPClassifier 

    df = pd.read_csv("ob.csv", encoding = 'cp1251', sep = ';')
    df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
    startdate = pd.to_datetime('2019-02-17') 


    X = ['X1', 'X2'] ????
    y = ['Y1'] ????
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) 
    clf.fit(x, y) 
    clf.predict(???????)  ????? df2 = ????

Where ???? - I do not know how to set the conditions correctly


Answer:

import pandas as pd
import numpy as np
import re
from sklearn.neural_network import MLPClassifier 

df = pd.read_csv("ob.csv", encoding = 'cp1251', sep = ';')
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
startdate = pd.to_datetime('2019-02-17') 

train = df[df['Date'] <= '2019-02-17']
test = df[df['Date'] > '2019-02-17']

X_train = train[['X1', 'X2']]
y_train = train[['Y1']]

X_test = test[['X1', 'X2']]
y_test = test[['Y1']]

clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) 
clf.fit(X_train, y_train) 
df2 = pd.DataFrame(clf.predict(X_test))
df2.to_csv('prediction.csv')