The Life expectancy Prediction Project using Machine learning!

Kartik Aggarwal
13 min readDec 19, 2022

--

In this Article, I used Real World Data Sets gathered from the websites of the United Nations and the World Health Organization. The data contains features like status, alcohol consumption, and percentage of expenditures to predict the life expectancy in adults.

Step -1 Understanding the Problem Statement

As I am saying at the beginning of this hands-on project, we will train a linear regression model to estimate life expectancy. And the data is a real-world data set that we will be using here. It basically includes the status, infant mortality, and use of alcohol. It also covers a variety of other economic data, such as GDP and the effective use of. The concept is that we will enter all of that data into a linear regression model to try to forecast life expectancy for us.

Click here to get the dataset.

Step-2 Importing the Necessary Libraries

I’ll be importing our Key libraries. And, in general, we’ll import pandas numpy, seaborn, and Matplollib. pyplot.
In basic terms, Pandas is used to manipulate data frames. Consider it as more often excel in Python.
Numpy is used to manipulate multidimensional arrays.
Mattplotlib and seaborn. They are used to visualise data. I particularly enjoy Seaborn since it’s quite strong and will generate extremely impressive visuals with very minimal lines of code.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Next, I’m going to read my data set.

life_expectancy_df = pd.read_csv('Life_Expectancy_Data.csv')
life_expectancy_df

So, basically, I have my data collection, and it appears that I have roughly 3000 samples and 21 columns. One of them is status, which refers to whether a country is developed world or in the developing world. We also have life expectancy here, which will be our objective variable. That is what we are attempting to predict.

Step 3 Cleaning the Data and performing Exploratory Data Analysis.

So first I’m going to use Seaborn, and I’m going to use the heatmap to essentially show me if I have missing values in my data frame or not.

sns.heatmap(life_expectancy_df.isnull(), yticklabels = False, cbar = False, cmap="Reds")

When we go and train our linear regression model, we want to make sure that we do not have any missing values.

life_expectancy_df.info()

You should be able to notice that the year we don’t have any missing data, for example, and that’s why we have 2938 non-nun values here. However, if you take a column like Alcohol, for example, we have 2744 for non-nun values, which indicates I have missing values, which is a difference between 2938 and 2744. And if you guys go up, we should be able to detect a number of missing data in the alcohol column.
So it appears that we have a large number of missing data, and we’ll learn how to deal with them later

Next, I can obtain a statistical summary of my data frame.

life_expectancy_df.describe()
The Summary comes out for each and every numerical column, you can scroll to the left in my Jupyter file.

Check at the life expectancy, or specify the year, and you should be able to see the minimum year. So, the year is 2000, and the lowest and maximum years are 2015, and you should be able to see three standard deviations as well as the mean, as well as the 25th, 50th, and 75th percentiles. Following that, I can map the histogram. So, if I double my life expectancy by DF

life_expectancy_df.hist(bins = 30, figsize = (20, 20), color = 'g');

So, as you can see, I have a lot of features here. So, for example, I have the life expectancy; humans are, I believe, meant to live most of the time right now. Again, the majority of the highest frequency of recurrence occurs between the ages of 70 and 80.
Consider another metric, such as GDP. So here I have the GDP or gross domestic product, and you should be able to see that many of the countries covered here have a low GDP.
If you wanted to dig into it, we also have schooling, which you should be able to observe most of the time. It’s between 10 and 15, and it gradually decreases as you get closer to 20.
Next, I wanted to utilize seaborn to create the pair plot, which is effectively a combination of all the scatter plots for all of the characteristics in my data frame.

plt.figure(figsize = (40,40))
sns.pairplot(life_expectancy_df)
#We're not going to be effective when we plot again all the 21 features together.

As you can see, I have essentially the various scatter plots between all of the characteristics, which is like 21 features in the rose and here on the columns. Please keep in mind that even if we zoom in a little, it will be difficult to see.
So permit me to zoom in even farther. So, following that, I’m going to plot the scatter plot for only the interesting characteristics.
I’ll say up front that I’m curious about the relationship between education and life expectancy.

sns.scatterplot(data = life_expectancy_df, x = 'Schooling', y = 'Life expectancy ')

So, essentially, when schooling raises the number of years that individuals go to school for, you should be able to observe life expectancy improve as well.

Next. If I wanted to increase or study the relationship between the GDP or gross domestic product and life expectancy.

sns.scatterplot(data = life_expectancy_df, x = 'GDP', y = 'Life expectancy ')

In general, it’s more of a, I don’t want to say, a linear relationship, but you should be able to see that life expectancy tends to rise in GDP growth.

If I wanted to study the income composition off resources is which is essentially how productive resources are being utilized or used by that given country and life expectancy.

sns.scatterplot(data = life_expectancy_df, x = 'Income composition of resources', y = 'Life expectancy ')

You should be able to observe a beautiful link between the two that could be readily fit, with 1/9 straight line suggesting that nations who genuinely have efficient use of resources tend to have, you know, a population with a high or long life expectancy, like pushing beyond 70 or 80 years old.

Next, if I wanted to take a look at the relationship between HIV and AIDS and life expectancy, and please know that HIV and AIDS is essentially the death rate in death rate because of HIV and AIDS.

sns.scatterplot(data = life_expectancy_df, x = ' HIV/AIDS', y = 'Life expectancy ')

We may also notice an inverse link between the two, which is that as the number of deaths grows due to AIDS and HIV, the life expectancy of that country’s population tends to decrease.Next, I’m going to use Seaborn heat map.

plt.figure(figsize = (20,20))
corr_matrix = life_expectancy_df.corr()
sns.heatmap(corr_matrix, annot = True)
plt.show()

So, for example, if I wanted to investigate, say, the number of years of schooling and life expectancy, you should be able to see 0.75 showing a positive connection between the two. As the number of years spent out of school grows in this country, you can anticipate life expectancy to rise as well. Another consideration is G. D. P s. And at 1.46, I’d say it’s contemporary, somewhat, moderately, favourably connected. Another component is, instance, the adult mortality rate, which is negative 0.7, demonstrating an inverse relationship between adult mortality and life expectancy, and another factor is HIV and AIDS. As I mentioned when we plot the scatter plot that there is an inverse correlation between HIV and AIDS, death rate and, of course, life expectancy.

Now We will sort of how many null values do I have in our data

life_expectancy_df.isnull().sum()[np.where(life_expectancy_df.isnull().sum() != 0)[0]]

Output:-

Life expectancy 10
Adult Mortality 10
Alcohol 194
Hepatitis B 553
BMI 34
Polio 19
Total expenditure 226
Diphtheria 19
GDP 448
Population 652
thinness 1-19 years 34
thinness 5-9 years 34
Income composition of resources 167
Schooling 163
dtype: int64

It appears that I am missing a lot of information from the Population Column. I also have a lot of missing data in the hepatitis B column and the GDP column. So, what I would do next is fill in the missing numbers with the average or mean of that particular column.

life_expectancy_df = life_expectancy_df.apply(lambda x: x.fillna(x.mean()),axis=0)
life_expectancy_df.isnull().sum()[np.where(life_expectancy_df.isnull().sum() != 0)[0]]

The code should now be able to fill in all the missing information if you press shift enter. And if you look to see if we have any missing pieces, you should be able to see that we don’t have any, which is precisely what I’m looking for.

STEP 4 CREATE TRAINING AND TESTING DATASET

In this Step, we are going to create training and testing data set.

I‘m going to do first is I’m going to simply take all our data set, except for life expectancy, which is simply my target output, and that will call it X. And then for the output, which is why lower case, That would be life expectancy.

X = life_expectancy_df.drop(columns = ['Life expectancy '])
y = life_expectancy_df[['Life expectancy ']]
X

I have the year. I’ve got the adult mortality rate. I’ve had newborn fatalities, for example. All I wanted to do was take all of the other columns, with the exception of life expectancy, which is this one because it’s my authority column, and utilize them as inputs.
As I previously stated, because we are attempting to forecast a continuous variable, we have a reregression-typeroblem. I have a classification difficulty if we wish to forecast category outputs.

y

next, I’m going to convert our data to float. 32 type data types are going to say NumPy array off X and then I’m going to set the data type to float 32.

X = np.array(X).astype('float32')
y = np.array(y).astype('float32')

If you simply look at what’s in X, I’ll end up with all of my data set, which is the exact same material that I had before. However, I now have it in the form of a numpy array. So, in essence, I transformed it from a data frame format to an AMP format.

I will do next is divide my data into training and testing. So, when we train a machine learning model, we take the complete data set and divide it roughly into 80% for training the model and 20% for testing the model. We utilise the training data set to effectively train the model to try to map the input to the output. And then, once the model has been trained, we take that trained model and test it. Alternatively, we evaluate its performance on data sets that the model has never handled before during training.And that’s the testing data set. And I’m going to say, trained to split, pass it along the X Y and specify here the test size.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

What I could do after that is go ahead and scale our data, and I’m going to do the same for the output simply because I’m just here scaling my data set.

from sklearn.preprocessing import StandardScaler

scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)

scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train)
y_test = scaler_y.transform(y_test)

STEP 5 TRAIN A LINEAR REGRESSION MODEL IN SCIKIT-LEARN

I’m importing the class, and then I’m going to also import several metrics.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score

regresssion_model_sklearn = LinearRegression(fit_intercept = True)
regresssion_model_sklearn.fit(X_train, y_train)

If you say fit, intercept is True. That suggests you’re giving the model a B value. If, on the other hand, you set why intercept equals two false, This implies that you are pushing the line to pass through the origin.
The model has been trained, and now I can examine the performance of my model so that I may move my model here, which is regression.
I’m going to create 80 regression models with varying degrees of accuracy.

regresssion_model_sklearn_accuracy = regresssion_model_sklearn.score(X_test, y_test)
regresssion_model_sklearn_accuracy

Accuracy = 0.8009968895850262

I was able to achieve an accuracy of around 80% approximately. And what I could do, finally, is I can go ahead and print out the coefficients, which are essentially m and B values

print('Linear Model Coefficient (m): ', regresssion_model_sklearn.coef_)
print('Linear Model Coefficient (b): ', regresssion_model_sklearn.intercept_)

STEP 6: EVALUATE THE TRAINED MODEL PERFORMANCE

I want to actually go ahead and assess model performance. So, if I take it and apply the prediction method on that model and I pass it along to my testing data, I should be able to generate predictions.

y_predict = regresssion_model_sklearn.predict(X_test)
y_predict

These are all the different model predictions again on the testing data. I mentioned before that the testing data is data that the model has never seen before during training, which is quite crucial. Okay, next, What I wanted to do is I wanted to simply plot the scaled data.

plt.plot(y_test, y_predict, "+", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')

I mentioned before that I have the model predictions on the X-axis. I have the two values here. And, once again, if the model is working properly, I should be able to see a perfectly straight line going through here.
Please keep in mind that when you look at the model predictions on the X-axis and the Y-axis, you will see that these Xs represent scaled data.
What if I wanted to view the original values, such as life expectancy values?

y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)

I can now do an inverse transform to scale my data back to its original units; what we’ve done here is simply the inverse operation of the scaling that we’ve done up there. , If you proceed to step 3, there we split the data and here we scale the data. Right? So what I could do right now is, instead of using the regular scaler, I’m just doing it in reverse. We just want to return to my original values.

plt.plot(y_test_orig, y_predict_orig, "+", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')

I was able to see where the model should have predicted the original values vs the actual two values, indicating that the model has been performing well.
Now I’m going to retrieve the mean squared error, apply the square root to it, then convert it to float format or cast it in float format.
And that’s the root, mean square. Another option is to specify that the MSC is the squared error solely off the white test original white forecast original. I can also compute the average absolute error.

# Plot the KPIs

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)

So, as you can see, the RMSE is about 4, MEE is around 18.4, MAE is 3.12, and R square is approximately 80.
So basically, R square is about 80, which is okay, but we can improve on it and maybe push it ahead to, say, 90 or so by using an advanced regress ER such as eggy boost regressive, for example. However, when I say R squared equals 2.8 or 80, I mean 80% wrong. Changes in life expectancy output have been accounted for by variations in intake, which is a positive thing. And, of course, I can compute the corrected square., which
is 0.802 here. And that’s simply how you assess the performance of the model.

Conclusion/Why we used Linear Regression

Let’s go ahead and understand the theory and intuition behind linear regression.

So we took the data, and we split it. It was split it into inputs and outputs x and y.
And then further afterward, we further spread that data into training and testing.

In simple linear regression, we attempt to predict the value of one variable, y, based on another variable, X. In general, we refer to X as an independent variable and y as the dependent variable. Simply said, you may look at it this way. I’ve got my independent variable on X here. This could be anything.
As an example. It might be, for example, the outside air temperature. That’s my dependent variable on the y-axis. And then there’s this: As an example, consider sales. You know, if you operate a business or your own, perhaps a bike rental shop, you can try to estimate sales based on temperature. As the temperature rises, perhaps people will ride their bikes more. As a result, sales may tend to rise.

We call it linear because the dependent variable increases or decreases in a linear way as the independent variable increases or decreases, which is why we call it simply linear.
So, when I say I want to develop or design a basic linear regression model, what I mean is that I want to discover the equation off that straight line. I’m attempting to discover the best value for M and B in order to reduce the inaccuracy.

I discussed multiple linear regression today if I had numerous or many independent variables, which really works extremely nicely for our case at hand. So, in multiple linear regression, we simply have many independent variables and are looking for various coefficients.

I hope you guys enjoyed this project and see you in future projects.

Here, I am signing off with my predictions.

Keep Following for more Content.

Regards

Kartik Aggarwal

Do refer to my Profiles for the Further Content.!!

Github : KartikAggarwal1305 (Kartik Aggarwal) (github.com)

Tableau : Profile — kartik.aggarwal6547 | Tableau Public

Linkedin : Kartik Aggarwal | LinkedIn

Medium : Kartik Aggarwal — Medium

--

--

Kartik Aggarwal

I am a passionate data and business analyst from Christ University. I am well certified in concepts such as Excel, Power Bi, Tableau, SQL and Python.