What is Orthogonalization in Machine Learning?

1. Introduction

We often use the term orthogonalization when discussing Machine Learning (ML) topics. However, this concept came from linear algebra long before any ML application.

In this tutorial, we’ll discuss what it means from a theoretical perspective and why it’s so important in ML.

2. What Is Orthogonalization?

We start by defining orthogonal vectors. Then, we apply the abstraction we build to train a neural network.

2.1. Orthogonalization in Linear Algebra

Let’s suppose we have vectors $v_1, v_2, \ldots v_i \in \mathbb{R}^n$ . The orthogonalization process will give us vectors $z_1, z_2, \ldots z_j \in \mathbb{R}^n$ such that

(1) $\begin{equation*} \textrm{span}\{v_1, \ldots v_i\} = \textrm{span}\{z_1, \ldots z_j\} \end{equation*}$

Each vector of the set $z_1, \ldots z_j$ should be orthogonal to the others for the set to be considered an orthogonal set. This means that they are 90° from each other; therefore, their dot product is zero.

Additionally, we define that if the vectors are unit-length in addition to being orthogonal, they are called orthonormal.

We can put all of this together mathematically:

(2) $\begin{equation*} \begin{cases}z_p^{\top} z_q = 0, \textrm{ for } p\neq q \textrm{ (orthogonality) } \\ z_q^{\top} z_q = 1 \textrm{ for } 1 \le q \le j \textrm{ (normalization) } \end{cases} ] \end{equation*}$

If that is the case, we can also say that $\{z_1, \ldots z_j \}$ forms an orthonormal basis of the span of the original vectors $\{v_1, \ldots v_i \}$ .

And as example we draw the vectors z_1 = [1,0] and z_2=[0,1] , which are orthonormal:

orthornormal vectors

But how can a concept that looks so simple be so relevant in ML?

2.2. Orthogonal Features

The main idea and abstraction we should bring from the linear algebra orthogonalization is independence.

This means that two orthogonal vectors will control independent things. In our example, z_1 is defined along the horizontal direction while z_2 is defined along the y-axis or vertical direction.

So when we do a PCA, for example, we have orthogonal principal components (PCs). This ensures that they are uncorrelated. So, we retain as much variance as possible from the original dataset while reducing dimensionality. This is only possible because we have components capturing the variability in different directions without redundancy. We considered a specific application of orthogonalization in feature engineering, now let’s go to a broader view in ML.

3. Orthogonalization in Machine Learning

Let’s imagine a full machine-learning workflow. We have a dataset, that we split into training, validation, and test. We design an architecture and we set hyperparameters. Our final goal is to have a model that performs sufficiently well in the three subsets as well as in real-world data.

But what if we change something to improve the performance in the training set and this negatively affects the performance in the test and the validation sets?

For this reason, we should think about the performance in each dataset as orthogonal to each other. Based on the problem at hand, we choose one or another approach to fix it.

3.1. Improving the Performance

First, we consider that the performance in the training set is not good enough. We can adjust the complexity of the neural network by adding some layers or changing the optimization algorithm.

However, if our model does not perform well on the validation set, we should consider adding some regularization. Alternatively, we can use a larger training set if it’s possible.

But what if we can’t perform well on the test set? We’ll likely need a bigger validation set. If we’re doing well on the validation set but not on the test set, we probably have an overfit.

Finally, if we don’t have good results in the real-world data, what can we do? We should take a second look at the problem formulation. It might even be necessary to change the cost function or define a different validation set.

As we can see, there are different things we can try depending on our problem. If we go back to our orthogonal concept, we must think of each of these solutions as independent from each other, as if we were tuning the different z_j to improve our model’s performance in the subspace of performances.

4. Conclusion

In this article, we presented how the concept of orthogonalization can be used in Machine Learning. In this way, we ensure that we are tuning one aspect of our workflow at a time aiming to have the desired performance in the problem at hand.

We should highlight and keep in mind that the adjustments must be done carefully and one at a time. By following the guidelines provided here, we can tackle each situation independently. So we don’t harm the performance in one subset to improve the performance in another subset.

Persistence

REST

Security