Join us

@adhithia ・ Feb 07,2022 ・ 6 min read ・ 1108 views ・ Originally posted on adhithia.medium.com

We talk and hear a lot about ** Regression** everywhere we go. So let us skip that. In this article, we shall assume that we are masters of

Now, from the model, you come to notice that it has *Overfit**— Overfitting* is a term used in data science to describe when a statistical model fits its training data perfectly and the algorithm is unable to perform accurately against unknown data, negating the goal of the method. Now this *Overfitting* could be because the model is too complex. This complexity will have to be reduced to improve the model and remove *Overfitting*.

*To do this we can either decrease the magnitude of some of the* *Regression Coefficients* *and/or we could drop some of the features that do not add significant value to the final prediction. The process we will be applying to achieve this is called* *Regularization**.*

So we have come to the end of this article. Thank you.

Lol, just kidding. Well, that’s how most articles on this topic go anyway. But trust me, this one is going to be different. Let me take you through the complete experience — starting with understanding why and where we need it, its statistical importance and also how we implement it on Python.

**Understanding Bias and Variance**

Let us get to the basics once more — *Bias* and *Variance*.

We say that a model’s *Bias* is high when the model performs poorly on the training dataset. The *Variance* is said to be high when the model performs poorly on the test dataset. Let me give you an example to make it more clear.

**Scenario 1:**

Assume you are preparing for an exam (Yes, please assume.) and you read answers only for 5 questions at the end of the chapter. With this, you will be able to answer very well if the exam has the same 5 questions you prepared for. But you will fail to answer other questions that have been asked in the exam. This is an example of ** Low Bias** and

**Scenario 2:**

But in another scenario, let us say that you have prepared for the exam by actually going through the entire chapter and understanding the concepts instead of memorizing the answers to 5 questions. With this, you will be able to answer any question that comes in the exam but you will not be able to answer it exactly as given in the book. This is an example of ** High Bias** and

So now, coming back to D*ata Science* terminologies — It is important to understand that there is always going to be a trade-off between *Bias* and *Variance* in any model that you build. Take a look at this graph that shows *Variance* and *Bias* for different ** Model Complexities**. A model with

So it is now safe to say that we would be needing a model to be built that has the *Lowest Total Error* — A model that is capable of identifying all the patterns from the *Train Data* and also performing well on data it has not seen before i.e. *Test Data*.

**How does Regularization help with Model Complexity?**

*Regularization* comes into the picture when we need to take care of the model complexity to balance *Bias-Variance.* Regularization helps decrease the magnitude of the model coefficients towards 0 (can also help in completely removing the feature from the model when coefficient become 0) thus bringing down the model complexity. This will in turn reduce the overfitting of the model and bring down the Total Error — exactly what we wanted to achieve.

Consider the *Ordinary Least Squares* (OLS Regression) — the *Residual Sum of Squares* or the *Cost Function* is given by:

When we build a Regression Model, we construct the feature coefficients in such a way that the Cost Function or the RSS is minimum. Note that this RSS takes into account only the *Bias* that comes out of the model and not the *Variance*. So the model may try to reduce the *Bias* and overfit the training dataset — Which will result in the model having *High Variance*.

Hence, *Regularization* comes into helping us here by modifying the *Cost Function* of the model a little to reduce the model complexity and thereby reducing *Overfitting*.

**So how do we actually do it?**

In *Regularization*, we add a *Penalty* term to the *Cost Function* that will help us in controlling the Model Complexity.

*After* *Regularization, Cost Function* *becomes* *RSS + Penalty**i.e. we add a Penalty term to the regular RSS in the cost function.*

**Ridge Regression**

*In* *Ridge Regression,* *we add a penalty term which is* *lambda (**λ)* *times the sum of squares of weights (model coefficients).*

Note that the penalty term (referred to as the ** Shrinkage Penalty**) has the sum of squares of weights. So in

Let us now understand what ** lambda** (

This is why we must choose the right value of λ to make sure the Model Complexity is effectively reduced but there is no *Overfitting* or *Underfitting*. Choosing the right λ value in Ridge Regression can be done by *Hyperparameter Tuning.*

Please note that before performing *Ridge Regression*, the dataset must be *Standardized*. This is because we are dealing with the coefficients of the model and it will make sense if they are on the same scale.

**Lasso Regression**

The important difference between Ridge and Lasso Regression lies in the Penalty Term.

*In* *Lasso Regression,* *we add a penalty term which is* *lambda times the sum of absolute values of weights (model coefficients).*

The remaining principles and the way the Penalty Term brings down the model coefficients to decrease the Model Complexity remains similar to that of *Ridge Regression*.

*But one difference that must be noted here is that in the case of* *Lasso Regression* *the* *Shrinkage Term(Penalty Term)* *forces some of the model coefficients to* *become exactly 0* *thereby* *removing the entire feature* *from the model (given that the λ value is large enough). This gives a whole new application of Lasso Regression —* *Feature Selection**.* *This is not possible in case of Ridge Regression.*

Please note that similar to *Ridge Regression*, before performing *Lasso Regression*, the dataset must be *Standardized*. This is because we are dealing with the coefficients of the model and it will make sense if they are on the same scale.

**Python Implementation of Regularization Techniques — Ridge and Lasso**

The entire implementation of Ridge and Lasso regression along with a detailed analysis of a dataset starting with Exploratory Data Analysis, Multiple Linear Regression including Multicollinearity, VIF Analysis etc. can be found here.

The dataset used in this implementation is of ‘** Surprise Housing Case Study**’.

You can find the ** Github** link to the Python implementation here: https://github.com/Adhithia/RegressionWithRegularization

You can edit and collaborate on the same project on Kaggle here:

https://www.kaggle.com/adhithia/regression-with-ridge-and-lasso-regularization

**Snippet — Ridge Regression Implementation**

` ````
from sklearn.linear_model import Lasso
# list of alphas to tune - if value too high it will lead to underfitting, if it is too low,
# it will not handle the overfitting
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0,
4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}
folds = 5
lasso = Lasso()
# cross validation
model_cv = GridSearchCV(estimator = lasso,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
model_cv.fit(X_train_lm, y_train)
# Printing the best hyperparameter alpha
print(model_cv.best_params_)
#Fitting Ridge model for alpha = 100 and printing coefficients which have been penalised
alpha =0.001
lasso = Lasso(alpha=alpha)
lasso.fit(X_train_lm, y_train)
lasso.coef_
```

That’s the end of *Regression Regularization Techniques — Ridge and Lasso*.

Join other developers and claim your FAUN account now!

Influence

Total Hits

Posts

Only registered users can post comments. Please, login or signup.