Table of Content:
- About Regularization
- Types of Penalties
- Regularization Technique
- Lasso Regression
- Example Of Lasso Regression
- Conclusion
About Regularization
To understand Lasso Regression, first we have to know what is Regularization.
So, Regularization is the concept that is used to avoid over fitting of the data by adding penalty to achieve less variance with the test data. The model is likely to perform better at prediction thereafter.
In simple term, it reduces parameters and simplifies the model so that it has the lowest over fitting or say less variance with test data, hence better prediction with test data.
For example, when we are training the model with train dataset which have two nodes, it fitted well with a straight line and for this its residual become zero. But when we test the model with test dataset we get high variance, which means our model is over fitted and we have to remove it.
It is observed that when we train our model it gives 100 percent accuracy on train data, but when we test it with test dataset its accuracy is substantially lower.
Types of Penalties
Regularization works by biasing data equal to or nearly equal to zero. In simple words it shrinks the slope of the data to find the best fit. Note the two Types of penalties as below.
L1 regularization
It adds a L1 penalty to the absolute or mode value of the magnitude of coefficient. In this process some coefficient can become zero and eliminated from model.
L2 regularization
It adds a L2 penalty to the square of the magnitude of coefficient. In this, all coefficients will shrink by the same factor to find the best fit for the model.
Regularization Technique
There are two main regularization techniques:
- Ridge Regression
- Lasso Regression.
Both have different way of assigning a penalty to the coefficients. In this article, we will learn about Lasso Regularization technique.
Lasso Regression
“LASSO” stands for Least Absolute Shrinkage and Selection Operator. This model uses shrinkage. Shrinkage basically means that the data points are recalibrated by adding a penalty so as to shrink the coefficients to zero if they are not substantial.
It uses L1 regularization penalty technique. This particular type of regression is well-suited for models showing high levels of multicollinearity or when we have to automate certain parts of model selection, like parameter elimination or feature selection.
We can represent Lasso loss functions mathematically as:
![]()
Figure 1 : Mathematical Formulation for LASSO Loss Function
Example Of Lasso Regression
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score as ac
Load the data set by clicking on this link - Vehicle Price dataset
price_data = pd.read_csv('vehicle_price.csv')
price_data.head(5)
Unnamed: 0 Name ... New_Price Price
0 0 Maruti Wagon R LXI CNG ... NaN 1.75
1 1 Hyundai Creta 1.6 CRDi SX Option ... NaN 12.50
2 2 Honda Jazz V ... 8.61 Lakh 4.50
3 3 Maruti Ertiga VDI ... NaN 6.00
4 4 Audi A4 New 2.0 TDI Multitronic ... NaN 17.74
price_data.info()
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 6019 non-null int64
1 Name 6019 non-null object
2 Location 6019 non-null object
3 Year 6019 non-null int64
4 Kilometers_Driven 6019 non-null int64
5 Fuel_Type 6019 non-null object
6 Transmission 6019 non-null object
7 Owner_Type 6019 non-null object
8 Mileage 6017 non-null object
9 Engine 5983 non-null object
10 Power 5983 non-null object
11 Seats 5977 non-null float64
12 New_Price 824 non-null object
13 Price 6019 non-null float64
dtypes: float64(2), int64(3), object(9)
memory usage: 658.5+ KB
price_data.isnull().sum()
Unnamed: 0 0
Name 0
Location 0
Year 0
Kilometers_Driven 0
Fuel_Type 0
Transmission 0
Owner_Type 0
Mileage 2
Engine 36
Power 36
Seats 42
New_Price 5195
Price 0
dtype: int64
# Droping location, new price and unnamed column price_data = price_data.drop(['Unnamed: 0', 'New_Price','Location'], axis=1) price_data = price_data.dropna() price_data = price_data.reset_index(drop=True) price_data['Fuel_Type'].value_counts() Diesel 3195 Petrol 2714 CNG 56 LPG 10 Name: Fuel_Type, dtype: int64 price_data['Transmission'].value_counts() Manual 4266 Automatic 1709 Name: Transmission, dtype: int64 price_data['Owner_Type'].value_counts() First 4903 Second 953 Third 111 Fourth & Above 8 Name: Owner_Type, dtype: int64
# Lets split some columns to make a new feature
train_df = price_data.copy()
name = train_df['Name'].str.split(" ", n =2, expand = True)
train_df['Company'] = name[0]
train_df['Model'] = name[1]
train_df['Mileage'] = train_df['Mileage'].str.split(" ", n=1, expand = True).get(0)
train_df['Engine'] = train_df['Engine'].str.split(" ", n=1, expand = True).get(0)
train_df['Power'] = train_df['Power'].str.split(" ", n=1, expand = True).get(0)
train_df = train_df.drop(['Name'], axis = 1)
train_df['Mileage'] = train_df['Mileage'].astype(float)
train_df['Engine'] = train_df['Engine'].astype(int)
train_df.replace("null", np.nan, inplace = True)
train_df = train_df.dropna()
train_df = train_df.reset_index(drop=True)
train_df['Power'] = train_df['Power'].astype(float)
train_df['Company'].value_counts()
Maruti 1175
Hyundai 1058
Honda 600
Toyota 394
Mercedes-Benz 316
Volkswagen 314
Ford 294
Mahindra 268
BMW 262
Audi 235
Tata 183
Skoda 172
Renault 145
Chevrolet 120
Nissan 89
Land 57
Jaguar 40
Mitsubishi 27
Mini 26
Fiat 23
Volvo 21
Porsche 16
Jeep 15
Datsun 13
Force 3
ISUZU 2
Bentley 1
Ambassador 1
Isuzu 1
Lamborghini 1
Name: Company, dtype: int64
train_df['Company'] = train_df['Company'].replace('ISUZU', 'Isuzu')
# Handling Rare Categorical Feature
cat_features = [feature for feature in train_df.columns if train_df[feature].dtype == 'O']
for feature in cat_features:
temp = train_df.groupby(feature)['Price'].count()/len(train_df)
temp_df = temp[temp > 0.01].index
train_df[feature] = np.where(train_df[feature].isin(temp_df), train_df[feature], 'Rare')
train_df['Company'].value_counts()
Maruti 1175
Hyundai 1058
Honda 600
Toyota 394
Mercedes-Benz 316
Volkswagen 314
Ford 294
Mahindra 268
BMW 262
Rare 247
Audi 235
Tata 183
Skoda 172
Renault 145
Chevrolet 120
Nissan 89
Name: Company, dtype: int64
train_df.info()
RangeIndex: 5872 entries, 0 to 5871
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 5872 non-null int64
1 Kilometers_Driven 5872 non-null int64
2 Fuel_Type 5872 non-null object
3 Transmission 5872 non-null object
4 Owner_Type 5872 non-null object
5 Mileage 5872 non-null float64
6 Engine 5872 non-null int32
7 Power 5872 non-null float64
8 Seats 5872 non-null float64
9 Price 5872 non-null float64
10 Company 5872 non-null object
11 Model 5872 non-null object
dtypes: float64(4), int32(1), int64(2), object(5)
memory usage: 527.7+ KB
train_df['Seats'] = train_df['Seats'].astype(int)
# Encoding Categorical data
columns = ['Fuel_Type','Transmission','Owner_Type','Company','Model']
def categorical_ohe(multicolumns):
df = train_df.copy()
i = 0
for fields in multicolumns:
print(fields)
d1 = pd.get_dummies(train_df[fields])
train_df.drop([fields], axis = 1)
if i == 0:
df = d1.copy()
else:
df = pd.concat([df, d1], axis = 1)
i = i + 1
df = pd.concat([df,train_df], axis = 1)
return df
final_df = categorical_ohe(columns)
final_df = final_df.loc[:,~final_df.columns.duplicated()]
now = datetime.datetime.now()
final_df['Year'] = final_df['Year'].apply(lambda x : now.year - x)
corr = final_df.corr()
corr
Diesel Petrol Rare ... Power Seats Price
Diesel 1.000000 -0.977947 -0.113891 ... 0.292420 0.309581 0.321035
Petrol -0.977947 1.000000 -0.096114 ... -0.272662 -0.303177 -0.309363
Rare -0.113891 -0.096114 1.000000 ... -0.096618 -0.033244 -0.058408
Automatic 0.139557 -0.125612 -0.067592 ... 0.644688 -0.074554 0.585623
Manual -0.139557 0.125612 0.067592 ... -0.644688 0.074554 -0.585623
... ... ... ... ... ... ...
Mileage 0.097562 -0.130056 0.153696 ... -0.538844 -0.331576 -0.341652
Engine 0.430151 -0.410837 -0.095742 ... 0.866301 0.401116 0.658047
Power 0.292420 -0.272662 -0.096618 ... 1.000000 0.101460 0.772843
Seats 0.309581 -0.303177 -0.033244 ... 0.101460 1.000000 0.055547
Price 0.321035 -0.309363 -0.058408 ... 0.772843 0.055547 1.000000
corr[corr['Price'] > 0.4]
Diesel Petrol Rare ... Power Seats Price
Automatic 0.139557 -0.125612 -0.067592 ... 0.644688 -0.074554 0.585623
Engine 0.430151 -0.410837 -0.095742 ... 0.866301 0.401116 0.658047
Power 0.292420 -0.272662 -0.096618 ... 1.000000 0.101460 0.772843
Price 0.321035 -0.309363 -0.058408 ... 0.772843 0.055547 1.000000
df = final_df.drop(final_df[columns],axis=1)
X = df.drop(['Price'],axis=1)
y = df['Price']
X.head(5)
Diesel Petrol Rare Automatic ... Mileage Engine Power Seats
0 0 0 1 0 ... 26.60 998 58.16 5
1 1 0 0 0 ... 19.67 1582 126.20 5
2 0 1 0 0 ... 18.20 1199 88.70 5
3 1 0 0 0 ... 20.77 1248 88.76 7
4 1 0 0 1 ... 15.20 1968 140.80 5
# Splitting Dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0)
# Feature Scaling
scaler = StandardScaler()
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
best_alpha = 0.00099
regr = Lasso(alpha=best_alpha, max_iter=50000)
regr.fit(X_train,y_train)
y_pred = regr.predict(X_test)
ac(y_test,y_pred)
0.7761524710799422
Conclusion
In this article we have learnt about Regularization, types of penalty in regularization and techniques or types of regularization. And then we have learnt about Lasso Regression in brief and how to implement it. In our model we got the accuracy of 77% which can be further increased by hyperparameter tuning, in case you don’t have any idea about how to do hyperparameter tuning you can please refer to this previous article about how to do hyperparameter tuning.
About the Author's:
Write A Public Review