100% Accuracy using SVC classification, something must be wrong?

100% classification accuracy
svm 100% accuracy
logistic regression 100% accuracy
decision trees can obtain 100 train accuracy on any dataset
decision tree accuracy score 1
how to check accuracy of decision tree
improve decision tree accuracy python
testing accuracy vs training accuracy

Context to what I'm trying to achieve:

I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).

import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC

def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature

trnArray = np.zeros([10000,324]) 
tstArray = np.zeros([1000,324])

for i in range (0, 10000 ):
    trnFeatures = computeFeatures(trnImages[:,:,:,i])
    trnArray[i,:] = trnFeatures


for i in range (0, 1000):
    tstFeatures = computeFeatures(tstImages[:,:,:,i])
    tstArray[i,:] = tstFeatures


pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)

# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels 
train_data = trnModel
train_labels = trnLabels 

C = 1 
model = SVC(kernel='linear', C=C)

model.fit(train_data, train_labels.ravel())

y_pred = model.predict(test_data)

accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0] 
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))

Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?

First of all,

pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)

this is wrong. You have to use:

tstModel = pca.transform(tstArray)

Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.

Just for interest, check the balance of classes.

Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.

Why am I getting 100% accuracy for SVM and Decision Tree (scikit , I was able to reproduce your results: > clf = svm.SVC() > scores = cross_validation.cross_val_score(clf, X, Y, cv=10). I didn't get perfect out of fold classification,  I do not know why my accuracy score, f1 are all 100 %. the code looks like : x = f2.iloc[:,:5] y = f2.loc[:,'rain_tomorrow'] X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2) from sklearn.metrics import confusion_matrix from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train) tree_predicted = dt.predict(X_test

The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.

I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?

The PCA is standard for images, but should be followed by something very simple such as K-means.

The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.

Otherwise the calculation looks fine.


accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0] 

On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).

I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question. BTW could some explain, "shape[0]"please? Is it needed?

Is it possible that accuracy for classification of testing data set is 1.00?, I am working to improve classification results with my algorithm. In the latter case, 100% classification accuracy can easily result for classifiers with What I found often helpful is to try to visualize the data (both training and testing) as for the classifier to learn the domain (something you have to work out for yourself and  Introduction. Classification is an ubiquitous task in Science, Technology and the Humanities .Usage ranges from diagnosing diseases or the status of tumors using gene expression data to the actual classification of tumor classes ; from analyzing human performance in perceptual tasks to analyzing that of automated remote sensors or automatic speech recognition machines .

One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.

This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.

Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?

Advances in Computer Vision: Proceedings of the 2019 Computer , The best classifier (AdaBoost) provided 100% accuracy using features that were These holes played a significant role in the classification task. Classifier Accuracy Precision Recall Confusionmatrix Support vector machine 0.862 0.785 and neural network classifiers is the number of true negatives and false positives. So I thought I don't need any preprocessing. But when I run SVM and decision tree classifiers from scikit-learn, I got 100% accuracy using cross-validation with 10 folds. However the classification accuracy seems to decrease as I perform more iterations.

eWork and eBusiness in Architecture, Engineering and Construction: , As can be seen the ANN classification arrives at a high degree of accuracy, not and output classes) but also of erroneous and of false-positive classifications. matrix indicates the target class (what the ANN classification should result to) trees using various classifiers, discriminant analysis, support vector machines,  My question: I am getting a 100% accuracy out of the prediction done on the testing set. Is this bad? It seems too good to be true. The objective is waveform recognition on four on each other depending waveforms. The features of the dataset are the cost results of Dynamic Time Warping analysis of waveforms with their target waveform.

ECAI 2016: 22nd European Conference on Artificial Intelligence, 29 , A balanced accuracy was used to evaluate the classification: Acc = 1 where FN​(i) (false negatives) is the number of samples belonging to i incorrectly classified Therefore, Acc is 1.0 for a 100% accuracy, 0.5 when the classifier assigns all with k = 3 are reported) and the Support Vector Machine classifier (SVC), with a  In the latter case, 100% classification accuracy can easily result for classifiers with many "parameters" (i.e. high capacity) such as e.g. the nearest-neighbor classifier. Best regards, Mike

.predict_proba() for SVC produces incorrect results for binary , For example, it works fine for 3 way classification, which is in the test: This is separate from #13211: if you'd force scikit-learn to do something similarly bad with import numpy as np from sklearn.svm import SVC n_sample = 100 n_dim Model SVC(, probability=False) - train accuracy=1.000000 Model  When NOT to use Accuracy: Accuracy should NEVER be used as a measure when the target variable classes in the data are a majority of one class. Ex: In our cancer detection example with 100 people

Comments
  • What if it is right? Why do you think it is wrong?
  • Im working on an assignment. The accuracy should be between 40-50%
  • No idea why it is wrong. I was wondering how you knew
  • I'd say that a linear SVM is a very simple classifier; could you detail why you are worried about it?
  • Simple, the results. No journal would accept results suspected of over-fitting. Its the key criticism of deep learning because over-fitting can be the source of false positives. If the same results were obtained following e.g. K-means, then I would try NMF instead of PCA. So, you could be right, be it would need to be emperically demonstrated.
  • Sure, this result smells a lot, that's why OP has brought it here. But on the other hand, if you have a decent test set, then how could you possibly over-fit, unless you cheat in one way or another? I fail to see how this should be model related.
  • Under a standard test statistic for a trained dataset no peer-reviewer would allow publication, and 'over-fitting' would be the first panic button. However, the questioner hasn't performed this calculation. The accuracy statistic would appear to guard against over-fitting.