Cannot understand with sklearn's PolynomialFeatures

sklearn preprocessing
fit_transform sklearn
sklearn polynomial regression
sklearn binning
sklearn pipeline
sklearn polynomialfeatures pipeline
sklearn feature cross
feature combination sklearn

Need help in sklearn's Polynomial Features. It works quite well with one feature but whenever I add multiple features, it also outputs some values in the array besides the values raised to the power of the degrees. For ex: For this array,

X=np.array([[230.1,37.8,69.2]])

when I try to

X_poly=poly.fit_transform(X)

It outputs

[[ 1.00000000e+00 2.30100000e+02 3.78000000e+01 6.92000000e+01
5.29460100e+04 8.69778000e+03 1.59229200e+04 1.42884000e+03
2.61576000e+03 4.78864000e+03]]

Here, what is 8.69778000e+03,1.59229200e+04,2.61576000e+03 ?

If you have features [a, b, c] the default polynomial features(in sklearn the degree is 2) should be [1, a, b, c, a^2, b^2, c^2, ab, bc, ca].

2.61576000e+03 is 37.8x62.2=2615,76 (2615,76 = 2.61576000 x 10^3)

In a simple way with the PolynomialFeatures you can create new features. There is a good reference here. Of course there are and disadvantages("Overfitting") of using PolynomialFeatures(see here).

Edit: We have to be careful when using the polynomial features. The formula for calculating the number of the polynomial features is N(n,d)=C(n+d,d) where n is the number of the features, d is the degree of the polynomial, C is binomial coefficient(combination). In our case the number is C(3+2,2)=5!/(5-2)!2!=10 but when the number of features or the degree is height the polynomial features becomes too many. For example:

N(100,2)=5151
N(100,5)=96560646

So in this case you may need to apply regularization to penalize some of the weights. It is quite possible that the algorithm will start to suffer from curse of dimensionality (here is also a very nice discussion).

Frequently Asked Questions, If your problem raises an exception that you do not understand (even after Outside of neural networks, GPUs don't play a large role in machine learning today,  When using anaconda with MKL and trying to build sklearn, I get Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so. Can anyone explain what is happening / how to built?

PolynomialFeatures generates a new matrix with all polynomial combinations of features with given degree.

Like [a] will be converted into [1,a,a^2] for degree 2.

You can visualize input being transformed into matrix generated by PolynomialFeatures.

from sklearn.preprocessing import PolynomialFeatures
a = np.array([1,2,3,4,5])
a = a[:,np.newaxis]
poly = PolynomialFeatures(degree=2)
a_poly = poly.fit_transform(a)
print(a_poly)

Output:

 [[ 1.  1.  1.]
 [ 1.  2.  4.]
 [ 1.  3.  9.]
 [ 1.  4. 16.]
 [ 1.  5. 25.]]

You can see matrix generated in form of [1,a,a^2]

To observe polynomial features on scatter plot, let's use number 1-100.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

#Making 1-100 numbers
a = np.arange(1,100,1)
a = a[:,np.newaxis]

#Scaling data with 0 mean and 1 standard Deviation, so it can be observed easily
scaler = StandardScaler()
a = scaler.fit_transform(a)

#Applying PolynomialFeatures
poly = PolynomialFeatures(degree=2)
a_poly = poly.fit_transform(a)

#Flattening Polynomial feature matrix (Creating 1D array), so it can be plotted. 
a_poly = a_poly.flatten()
#Creating array of size a_poly with number series. (For plotting)
xarr = np.arange(1,a_poly.size+1,1)

#Plotting
plt.scatter(xarr,a_poly)
plt.title("Degree 2 Polynomial")
plt.show()

Output:

Changing degree=3 ,we get:

[PDF] scikit-learn user guide, If your problem raises an exception that you do not understand (even after googling it), please make signals that traditional tools cannot see. Typically, any structured dataset includes multiple columns – a combination of numerical as well as categorical variables. A machine can only understand the numbers. It cannot understand the text. That’s essentially the case with Machine Learning algorithms too.

You have 3-dimensional data and the following code generates all poly features of degree 2:

X=np.array([[230.1,37.8,69.2]])
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures()
X_poly=poly.fit_transform(X)
X_poly
#array([[  1.00000000e+00,   2.30100000e+02,   3.78000000e+01,
#      6.92000000e+01,   5.29460100e+04,   8.69778000e+03,
#      1.59229200e+04,   1.42884000e+03,   2.61576000e+03,
#      4.78864000e+03]])

This can also be generated with the following code:

a, b, c = 230.1, 37.8, 69.2 # 3-dimensional data
np.array([[1,a,b,c,a**2,a*b,c*a,b**2,b*c,c**2]]) # all possible degree-2 polynomial features
# array([[  1.00000000e+00,   2.30100000e+02,   3.78000000e+01,
      6.92000000e+01,   5.29460100e+04,   8.69778000e+03,
      1.59229200e+04,   1.42884000e+03,   2.61576000e+03,
      4.78864000e+03]])

sklearn.metrics.confusion_matrix, sklearn.metrics. confusion_matrix (y_true, y_pred, labels=None, sample_weight=​None, normalize=None)[source]¶. Compute confusion matrix to evaluate the  I was trying to kaggle kernel of Bayesian Hyperparam Optimization of RF. And I couldn't import sklearn.gaussian_process.GaussianProcess. Please help this poor scikit-learn newbie. from sklearn.

Python Data Science Essentials, Let's start with some code: In: import numpy as np from sklearn import Don't worry if you cannot fully understand the code; MAE and the regressors are  Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more ImportError: cannot import name 'MultiOutputMixin' from 'sklearn.base'

Practical Machine Learning with Python: A Problem-Solver's Guide , We will be covering model evaluation and tuning strategies in detail in Chapter 5, so do not despair if you cannot understand some of the terminology right now. Machine Learning in Python. Simple and efficient tools for data mining and data analysis. Accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib.

6.2. Feature extraction, However the raw data, a sequence of symbols cannot be fed directly to the In order to address the wider task of Natural Language Understanding, the local  Starting from version 0.22, scikit-learn will show FutureWarnings for deprecations, as recommended by the Python documentation . FutureWarnings are always shown by default by Python, so the custom filter has been removed and scikit-learn no longer hinders with user filters. #15080 by Nicolas Hug. Changed models ¶

Comments
  • Why does it gives ab,bc,ca?
  • @TechieBoy101: It's polynomial features, not monomial features. There's nothing restricting it to only one variable at a time.
  • @TechieBoy101, The default PolynomialFeatures in sklearn includes all polynomial combinations. You can add interaction_only=True to exclude the powers like a^2, b^2, c^2. Of course you can exclude the interaction if your model performs better - the PolynomialFeatures are a simple way to derive new features (in some artificial manner).