How to easily improve model’s performance with standardization

Wojciech Prazuch
4 min readAug 12, 2020

In this article, we will work with a well-known Titanic dataset available at https://www.kaggle.com/c/titanic. The data consists of three files — we will use only train.csv in our experiments. The goal is to check the performance of the SVM model before and after standardization

For this experiment, we will use Pandas, Numpy, and Scikit-Learn libraries.

Let us import the necessary libraries and load the dataset:

The table consists of many different features — we see the names of the unfortunate passengers, their genders, and ticket numbers. For the sake of simplicity of our tutorial, we will only take a subset of a couple of features, mainly numerical ones:

Apart from Sex feature, all the other features have numerical values (even if they are categorical). To make use of that feature, we need to encode it to a numerical value. Thanks to LabelEncoder class provided by sklearn, we can accomplish that with few lines of code:

Everything is now a number! Good, but for a complete success, we will additionally check if there are any missing values in our data. To do that, we use a convenient pandas built-in function:

It turns out that we miss quite a bit of Age information for our passengers. Let’s simply fill the missing values with the median of the age of the passengers. The reason for using median rather than mean of the Age feature is simple — the median tends to be more robust for outliers in the data.

There are more sophisticated methods for handling missing values, but for now, we just use that simple solution. We can fill missing cells with a median by running the following piece of code:

Okay, let’s move to the most important part, which is classification. We will first split our dataset into two parts: train and test. The training dataset will be used for training an SVM classifier with a radial basis function kernel, and the performance of the classifier will be checked on the testing set.

Now when we have our data split into two parts, we can train and evaluate the model.

It happens to be 0.68, which is not good. Let us quickly examine the ranges of the data:

We see that the ranges quite differ. It can be quite troublesome for our classifier, as it uses a radial basis function as its kernel. The radial basis function uses euclidean distances between observations to generate support vectors for classification. If our features happen to be in various ranges, the calculation of Euclidean distance is more dependent on features with wider ranges of values. We can quickly fix it with StandardScaler class, again provided by sklearn library:

What StandardScaler does is that it calculates mean and standard deviation of every feature. Then, for every column, it subtracts the mean of that column and divides by standard deviation of that column. This way, our data is centered at 0 and has unit variance. We can double-check means and standard deviations:

We see, that the means and standard deviations calculated by hand align with those generated by StandardScaler class.

Finally, let us perform classification once again:

Our accuracy has jumped 0.83! with a couple lines of code. By using standardization, we have improved our classification greatly, without too much sweat!

SVM classifier is not the only classifier, which likes standardization! You can try standardization for a logistic regression or even random forests — it is rarely a bad move! The only drawback could be a loss of interpretability of our features. The ages of Titanic passengers were probably higher than 4, nevermind the negative Age values after standardization…

--

--

Wojciech Prazuch
0 Followers

Working at Netguru. Passionate about Machine Learning, Politics and Travel | Linkedin: https://www.linkedin.com/in/wprazuch