1. Introduction
In scikit-learn, the Pipeline class and make_pipeline function are valuable tools for building machine learning workflows. In particular, they allow chaining multiple transformations and estimators. This creates a streamlined process that handles data processing and model training as a unified operation.
This approach becomes particularly useful when performing cross-validation, hyperparameter tuning, or grid searches.
In this tutorial, we’ll explore the key differences between Pipeline and make_pipeline, emphasizing when and how to use each. We’ll also provide code examples that illustrate their respective usage.
2. Pipeline
The Pipeline class of scikit-learn allows the chaining of multiple steps together where each step is either a transformer (such as StandardScaler) or an estimator (such as LogisticRegression). Additionally, each step in the pipeline has a name, and we must explicitly define these names when constructing the pipeline.
Furthermore, the main advantage of using a pipeline is that it automates applying transformations and estimations in sequence. In practice, we define the steps and scikit-learn handles the rest. This makes code cleaner and less error-prone.
For example, let’s consider a scenario where we standardize the data using StandardScaler. Then, apply LogisticRegression for classification.
Here, we use scikit-learn’s make_classification function to generate synthetic data and construct a pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
# Define the pipeline with named steps
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Standardize the data
('log_reg', LogisticRegression()) # Step 2: Apply Logistic Regression
])
# Fit the pipeline on the data
pipeline.fit(X, y)
# Make predictions
predictions = pipeline.predict(X)
# Output the first 10 predictions
print(predictions[:10])
The output looks like this:
[1 0 1 0 0 1 1 0 1 1]
In this code, we use Pipeline to scale the features and apply logistic regression. Let’s explain in detail what the code does. First, we generate a synthetic dataset with 100 samples and 5 features using regression. Then, we define the pipeline. The first step is scaler and the second step is log_reg. Additionally, we fit the pipeline by using pipeline.fit(X, y) command to scale the feature and then fit the logistic regression model using scaled data.
Finally, after the fitting, we use the pipeline.predict(X) to make predictions on the same dataset.
3. make_pipeline
The make_pipeline*()* function in scikit-learn is a simpler alternative to the Pipeline class. It’s a shorthand method for creating a pipeline without explicitly naming the steps. Unlike the Pipeline class, where each step requires a name, make_pipeline automatically assigns names to the steps based on the classes of the transformers or estimators used.
Under the hood, make_pipeline() uses the Pipeline class. if we look at the source code of the make_pipeline, it simply constructs a Pipeline object and automatically generates names for the steps:
from sklearn.pipeline import Pipeline
def make_pipeline(*steps, memory=None, verbose=False):
return Pipeline(_name_estimators(steps), memory=memory, verbose=verbose)
As shown in the code above, make_pipeline is simply a convenience function that creates a Pipeline for us. This means both approaches are fundamentally the same. make_pipeline just removes the need to manually assign step names, while still returning a Pipeline object.
Additionally, the key benefit of make_pipeline is its simplicity and cleaner syntax, especially when we don’t need to access individual steps later. In particular, it’s useful for quickly chaining steps together when the names of the steps aren’t important.
Now, let’s revisit the previous example, but this time we use make_pipeline() to simplify the code:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
# Define the pipeline using make_pipeline
pipeline = make_pipeline(
StandardScaler(), # Step 1: Standardize the data
LogisticRegression() # Step 2: Apply Logistic Regression
)
# Fit the pipeline on the data
pipeline.fit(X, y)
# Make predictions
predictions = pipeline.predict(X)
# Output the first 10 predictions
print(predictions[:10])
The output looks like this:
[1 0 1 0 0 1 1 0 1 1]
In this code, we use make_pipeline which tells scikit-learn to automatically assign step names based on the lowercase version of the class name. For instance, StandardScaler() becomes ‘standardscaler’ and LogisticRegression() becomes ‘logisticregression’.
Additionally, we noticed that the pipeline definition is more concise than the explicit naming required in the Pipeline class.
4. Differences
The table below summarizes the key differences between Pipeline and make_pipeline:
Features
Pipeline
make_pipeline
Step Naming
The user explicitly names the steps used
The steps used are automatically generated by my scikit-learn
Syntax
It requires a manual definition of both the name and step
The syntax is simpler without outright naming of steps
Use Case
It’s ideal when we need to access or modify individual steps later
Best for quickly chaining steps when step names aren’t important
Flexibility
It’s more flexible
It’s less flexible in terms of naming steps, as it auto-generate names
Customization
We can fully customize the names of each step for better clarity
Names are based on the lowercase version of the transformer/estimator class names
Thus, while the naming convention is the most apparent difference, the flexibility and level of control offered by Pipeline make it a better choice for more complex workflows. Whereas make_pipeline shines in simplicity for straightforward tasks.
5. Conclusion
In this article, we’ve explored both Pipeline and make_pipeline which are valuable tools for streamlining machine learning workflows in scikit-learn. While Pipeline offers more flexibility with explicit naming, make_pipeline provides a concise and quick way to chain transformations and estimators together.
The choice between the two depends on the complexity of the use case and how much control we need over the pipeline structure.