1. Introduction

Scikit-learn (sklearn) is a popular Python-based machine learning library for performing data science tasks. The library includes implementations for various machine learning algorithms such as classification, clustering, regression, etc. Additionally, it includes algorithms for data pre-processing, feature engineering, and cross-validation.

In this tutorial, we’ll explore the differences between two functions from the sklearn library: transform() and fit_transform().

2.  What Is transform()?

The transform() function is used after fit() is called.  While f**it() learns model parameters from the training dataset*, transform()* modifies the data with the parameters learned from fit(). This involves modifications to the data structure or values, such as rescaling values or removing columns. Hence, the output is the modified data.

Typically, transform() can be applied to the same data that fit() was called on. However, we can also apply transform() to new data. For instance, suppose we have two datasets. We can apply fit() to the first dataset to learn parameters and then use the same parameters to transform the second dataset using transform().

2.1. An Example

Let’s look at a simple example of transform() in sklearn. We’ll use MinMaxScaler to scale numeric values to the range (0, 1):

from sklearn.preprocessing import MinMaxScaler
import numpy as np 
scaler = MinMaxScaler() 

Suppose we want to scale two subsets of data, X_train and X_test, from the same dataset. First, we’ll call fit() on X_train:

X_train = np.array([[1, 2], [3, 7], [5, 10]])
X_test = np.array([[1, 8], [9, 6]])
scaler.fit(X_train)

When we call scaler.fit() on X_train, it learns the parameters of MinMaxScaler and yields a scaler object. This object stores the maximum and minimum values it found in each column in X_train. When we apply transform() on X_train, it uses the minimum and maximum values from X_train to scale the data*:*

X_train_transform = scaler.transform(X_train)
print(X_train_transform)

# This will print [[0., 0. ], [0.5, 0.625], [1., 1. ]]

We can also call transform() on X_test to scale it using the minimum and maximum computed during the fitting of the scaler.

3. What Is  fit_transform()?

In contrast, fit_transform() combines fit() and transform() in one step. So, fit_transform() transforms the same data that it learns the parameters from. fit_transform() is often considered a more efficient as it implements both fit() and transform() in one step.

3.1. An Example

If we call fit_transform() on X_train, it learns the minimum and maximum and transforms X_train:

scaler_2 = MinMaxScaler()
X_train_transformed = scaler_2.fit_transform(X_train)

The output is still the modified data with values between 0 and 1:

[[0., 0. ], [0.5, 0.625], [1., 1. ]]

4. Using transform() and fit_transform() in Pipelines

These methods are used within sklearn’s pipelines. A pipeline defines the steps applied sequentially to the data. For example, we can create a pipeline that applies MinMaxScaler and Binarizer to the data to scale them and transforms the values into binary (0 or 1), based on a given threshold.

When transform() is called on a fitted pipeline, MinMaxScaler scales the data, and the output is then passed as input to the Binarizer.

In contrast, when we call the pipeline’s fit_transform() method, it iterates over the steps sequentially. In each step, it applies its transformer’s fit_transform() or combines fit() and transform() if the step’s transformer doesn’t implement fit_transform().

5. Conclusion

In this article, we’ve reviewed two common functions in the sklearn library: transform() and fit_transform(). *The transform() method modifies data using learned parameters from fit(), whereas fit_transform() combines fit() and transform() in a single step*.

Most sklearn transformers implement both fit() and fit_transform(), giving users flexibility on which method to use. fit_transform() is often used with training data to learn parameters and apply a transformation, whereas transform() is often used with test or new data after fit() has already been called.


原始标题:What’s the Difference Between ‘transform’ and ‘fit_transform’ in sklearn?