eduardotleite · 2 years
ANOVA Analysis
This is the task of Week 1 of the course Data Analysis Tools at the Coursera Plataform. The challenge is to execute an Analysis of Variance using the ANOVA Statistical Test. This type of analysis assesses whether the means of two or more groups are statistically different from each other. Is used whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference.
DataSet Used – Gap Minder Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions. Gapminder is an independent Swedish foundation with no political, religious, or economic affiliations. should visit it: https://www.gapminder.org/.
The dataset used has 16 variables and 213 rows. I choosed to analyze income per person (incomeperperson) and life expectancy (lifeexpectancy).
And how is the Question?
Is the life expectancy different among four categories of income per person (A,B,C,D,E)?
Since the income per person is a quantitative variable, I transformed it into a categorical variable, using parameters sugested by IBGE to classify the social class of according of income. For the parameters, I analyzed the boxplot posted below.
the data in image is in portuguese, because the IBGE is an Brazilian institute.
The Code
I used the Anaconda to code in Python for this task. The code is posted below.
import numpy import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import matplotlib.pyplot as plt import seaborn as sns import researchpy as rp import pycountry_convert as pc
df = pd.read_csv('gapminder.csv') df = df[['lifeexpectancy', 'incomeperperson']]
df['lifeexpectancy'] = df['lifeexpectancy'].apply(pd.to_numeric, errors='coerce') df['incomeperperson'] = df['incomeperperson'].apply(pd.to_numeric, errors='coerce')
def income_categories(row): if row["incomeperperson"]>15000: return "A" elif row["incomeperperson"]>5000: return "B" elif row["incomeperperson"]>3000: return "C" elif row["incomeperperson"]>1000: return "D" else: return "E"
df=df[(df['lifeexpectancy']>=1) & (df['lifeexpectancy']<=120) & (df['incomeperperson'] > 0) ]
df["Income_category"]=df.apply(income_categories, axis=1)
df = df[["Income_category","incomeperperson","lifeexpectancy"]].dropna()
df["Income_category"]=df.apply(income_categories, axis=1)
print (rp.summary_cont(df['lifeexpectancy']))
fig1, ax1 = plt.subplots() df_new = [df[df['Income_category']=='A']['lifeexpectancy'], df[df['Income_category']=='B']['lifeexpectancy'], df[df['Income_category']=='C']['lifeexpectancy'], df[df['Income_category']=='D']['lifeexpectancy'], df[df['Income_category']=='E']['lifeexpectancy']] ax1.set_title('life expectancy') ax1.boxplot(df_new) plt.show()
results = smf.ols('lifeexpectancy ~ C(Income_category)', data=df).fit() print (results.summary())
print ("Tukey") mc1 = multi.MultiComparison(df['lifeexpectancy'], df['Income_category']) print (mc1) res1 = mc1.tukeyhsd() print (res1.summary())
print ('means for for life expectancy by Income') m1= df.groupby('Income_category').mean() print (m1)
print ('Results') print ('standard deviations for life expectancy by Income') sd1 = df.groupby('Income_category').std() print (sd1)
Results – ANOVA Analysis
Aiming to answer the question of the task, I ran a test ANOVA. As shown below, from the 176 rows, 171 were used for the test, i have used a filter to remove some wrong values, as non numeric, negative, etc, reducing the rows of the original dataset
The ANOVA analysis shows a graph for each category (above) and, as we can see, the life expectancy of A class, have the life expectative of 80.39 years while the E class have the life expectative of 59.15 years.
