Using sklearn StandardScaler on only select columns

sk learn standard scaler
sklearn standardscaler on pandas series
sklearn minmaxscaler one column
maxabsscaler
sklearn preprocessing
columntransformer
standardscaler partial_fit
standardscaler keras

I have a numpy array X that has 3 columns and looks like the following:

array([[    3791,     2629,        0],
       [ 1198760,   113989,        0],
       [ 4120665,        0,        1],
       ...

The first 2 columns are continuous values and the last column is binary (0,1). I would like to apply the StandardScaler class only to the first 2 columns. I am currently doing this the following way:

scaler = StandardScaler()
X_subset = scaler.fit_transform(X[:,[0,1]])
X_last_column = X[:, 2]
X_std = np.concatenate((X_subset, X_last_column[:, np.newaxis]), axis=1)

The output of X_std is then:

array([[-0.34141308, -0.18316715,  0.        ],
       [-0.22171671, -0.17606473,  0.        ],
       [ 0.07096154, -0.18333483,  1.        ],
       ...,

Is there a way to perform this all in one step? I would like to include this as part of a pipeline where it will scale the first 2 columns and leave the last binary column as is.

I can't think of another way to compact you code more, but you can definitely use your transformation in a Pipeline. You have to define a class extending StandardScaler that only performs the transformations on the columns passed as arguments, keeping the others intact. See the code in this example, you would have to program something similar to ItemSelector.

How to normalize just one feature by scikit-learn?, Wanna apply a specific scaler, say StandardScaler, on a specific feature, Just pass one column to the scaler, and change the data inlace, something like: As mentioned, the easiest way is to apply the StandardScaler to only the to be scaled, and then concatenate the result with the remaining features. Below, bin_vars_index is an array of column indexes for the binary variable and cont_vars_index is the same for the continuous variables that you want to scale. You should create a custom scaler which ignores the last two columns while scaling. I have adapted @J_C code a bit to work with pandas data frame.

Thanks for the responses. I ended up using a class to select columns like this:

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, columns):
        self.columns = columns

    def fit(self, x, y=None):
        return self

    def transform(self, data_array):
        return data_array[:, self.columns]

I then used FeatureUnion in my pipeline as follows to fit StandardScaler only to continuous variables:

FeatureUnion(
    transformer_list=[
        ('continous', Pipeline([  # Scale the first 2 numeric columns
            ('selector', ItemSelector(columns=[0, 1])),
            ('scaler', StandardScaler())
        ])),
        ('categorical', Pipeline([  # Leave the last binary column as is
            ('selector', ItemSelector(columns=[2]))
        ]))
    ]
)

This worked well for me.

Preprocessing with sklearn: a complete and comprehensive guide, Rows or columns with to many non-meaningful missing values can be deleted We update our dataset by deleting all the rows (axis=0) with only missing values. from sklearn.preprocessing import StandardScalerscaler  I have a pandas dataframe with mixed type columns, and I'd like to apply sklearn's min_max_scaler to some of the columns. Ideally, I'd like to do these transformations in place, but haven't figure

Since scikit-learn version 0.20 you can use the function sklearn.compose.ColumnTransformer exactly for this purpose.

6.3. Preprocessing data, In practice we often ignore the shape of the distribution and just transform the data The preprocessing module further provides a utility class StandardScaler that Columns format (see scipy.sparse.csr_matrix and scipy.sparse.csc_matrix ). and robust_scale ) accept 1D array which can be useful in some specific case. Importing and using the MinMaxScaler works — just as all the following scalers — in exactly the same way as the StandardScaler. The only difference sits in the parameters on initiation of a new instance. Here we scale feature 3 (f3) to a scale between -3 and 3.

Inspired by skd's recommendation to extend StandardScaler, I came up with the below. It's not super efficient or robust (e.g., you'd need to update the inverse_transform function), but hopefully it's a helpful starting point:

class StandardScalerSelect(StandardScaler):

    def __init__(self, copy=True, with_mean=True, with_std=True, cols=None):
        self.cols = cols
        super().__init__(copy, with_mean, with_std)

    def transform(self, X):

        not_transformed_ix = np.isin(np.array(X.columns), np.array(self.cols), invert=True)

        # Still transforms all, just for convenience. For larger datasets
        # will want to modify self.mean_ and self.scale_ so the dimensions match,
        # and then just transform the subset
        trans = super().transform(X)

        trans[:,not_transformed_ix] = np.array(X.iloc[:,not_transformed_ix])

        return trans 

sklearn.preprocessing.StandardScaler, Mean and standard deviation are then stored to be used on later data using transform . Standardization of a dataset is a common requirement for many machine  class sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)¶. The standard score of a sample x is calculated as: where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Apply StandardScaler on a partial part of a data set, I want to use severals methods from StandardScaler from sklearn. Is it possible to use these functions only on the columns Age and Weight ? To select multiple columns by name or dtype, you can use make_column_transformer. remainder {‘drop’, ‘passthrough’} or estimator, default ‘drop’ By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped.

Feature Scaling with scikit-learn – Ben Alex Keen, In this post we explore 3 methods of feature scaling that are implemented The StandardScaler assumes your data is normally distributed within each DataFrame(scaled_df, columns=df.columns) fig = plt.figure(figsize=(9,  make_column_selector can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected. Parameters pattern str, default=None. Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.

Introducing the ColumnTransformer: applying different , In [5]:. from sklearn.preprocessing import StandardScaler, We can also use boolean masks (eg to make a selection of the columns based on  Using FunctionTransformer to select columns¶. Shows how to use a function transformer in a pipeline. If you know your dataset’s first principle component is irrelevant for a classification task, you can use the FunctionTransformer to select all but the first column of the PCA transformed data.

Comments
  • Thank you. The ItemSelector class you linked and FeatureUnion are what I would need to put this into a Pipeline.