coursera2022
coursera2022
Untitled
8 posts
Don't wanna be here? Send us removal request.
coursera2022 · 3 years ago
Text
Logistic Regression
Assignment: Test a Logistic Regression Model
Following is the Python program I wrote to fulfill the fourth assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Research question for this assignment
For this assignment, I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?
The potential other explanatory variables will be:
Age
Sex
Family income
Data management
The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.
The response variable having 2 categories, categories grouping is not needed.
The other categorical variable (sex) will be recoded such that 0 means female and 1 equals male. And the two quantitative explanatory variables (age and family income) will be centered.
In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.formula.api as smf import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display
In [2]:nesarc = pd.read_csv('nesarc_pds.csv') C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
In [3]:canabis_usage = {1 : 1, 2 : 0, 9 : 9} sex_shift = {1 : 1, 2 : 0} white_race = {1 : 1, 2 : 0} subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A']] .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)), white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)), used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce') .map(canabis_usage) .replace(9, np.nan)), family_income=lambda x: (pd.to_numeric(x['S1Q11A'], errors='coerce'))) .dropna()) centered_nesarc = subnesarc.assign(age_c=subnesarc['AGE']-subnesarc['AGE'].mean(), family_income_c=subnesarc['family_income']-subnesarc['family_income'].mean())
In [4]:display(Markdown("Mean age : {:.0f}".format(centered_nesarc['AGE'].mean()))) display(Markdown("Mean family income last year: {:.0f}$".format(centered_nesarc['family_income'].mean())))
Mean age : 46
Mean family income last year: 45631$
Let's check that the quantitative variable are effectively centered.
In [5]:print("Centered age") print(centered_nesarc['age_c'].describe()) print("\nCentered family income") print(centered_nesarc['family_income_c'].describe()) Centered age count 4.272500e+04 mean -2.667486e-13 std 1.819181e+01 min -2.841439e+01 25% -1.441439e+01 50% -2.414394e+00 75% 1.258561e+01 max 5.158561e+01 Name: age_c, dtype: float64 Centered family income count 4.272500e+04 mean -5.710829e-10 std 5.777221e+04 min -4.560694e+04 25% -2.863094e+04 50% -1.263094e+04 75% 1.436906e+04 max 2.954369e+06 Name: family_income_c, dtype: float64
The means are both very close to 0; confirming the centering.
Distributions visualization
The following plots shows the distribution of all 3 explanatory variables with the response variable.
In [6]:g = sns.factorplot(x='white_ethnicity', y='used_canabis', data=centered_nesarc, kind="bar", ci=None) g.set_xticklabels(['Non White', 'White']) plt.xlabel('White ethnicity') plt.ylabel('Ever used cannabis') plt.title('Ever used cannabis dependance on the white ethnicity');
In [7]:g = sns.factorplot(x='sex', y='used_canabis', data=centered_nesarc, kind="bar", ci=None) g.set_xticklabels(['Female', 'Male']) plt.ylabel('Ever used cannabis') plt.title('Ever used cannabis dependance on the sex');
In [8]:g = sns.boxplot(x='used_canabis', y='family_income', data=centered_nesarc) g.set_yscale('log') g.set_xticklabels(('No', 'Yes')) plt.xlabel('Ever used cannabis') plt.ylabel('Family income ($)');
In [9]:g = sns.boxplot(x='used_canabis', y='AGE', data=centered_nesarc) g.set_xticklabels(('No', 'Yes')) plt.xlabel('Ever used cannabis') plt.ylabel('Age');
The four plots above show the following trends:
More white people tries cannabis more than non-white
Male people tries cannabis more than female
Younger people tries cannabis more than older ones
Man from richer families tries cannabis more than those from poorer families
Logistic regression model
The plots showed the direction of a potential relationship. But a rigorous statistical test has to be carried out to confirm the four previous hypothesis.
The following code will test a logistic regression model on our hypothesis.
In [10]:model = smf.logit(formula='used_canabis ~ family_income_c + age_c + sex + white_ethnicity', data=centered_nesarc).fit() model.summary() Optimization terminated successfully. Current function value: 0.451313 Iterations 6
Out[10]:
Logit Regression ResultsDep. Variable:used_canabisNo. Observations:42725Model:LogitDf Residuals:42720Method:MLEDf Model:4Date:Sun, 24 Jul 2016Pseudo R-squ.:0.07529Time:16:15:55Log-Likelihood:-19282.converged:TrueLL-Null:-20852.LLR p-value:0.000coefstd errzP>|z|[95.0% Conf. Int.]Intercept-2.10430.032-66.7640.000-2.166 -2.043family_income_c2.353e-062.16e-0710.8800.0001.93e-06 2.78e-06age_c-0.03780.001-45.2880.000-0.039 -0.036sex0.50600.02619.7660.0000.456 0.556white_ethnicity0.35830.03211.2680.0000.296 0.421
In [11]:params = model.params conf = model.conf_int() conf['Odds Ratios'] = params conf.columns = ['Lower Conf. Int.', 'Upper Conf. Int.', 'Odds Ratios'] np.exp(conf)
Out[11]:Lower Conf. Int.Upper Conf. Int.Odds RatiosIntercept0.1146250.1296990.121930family_income_c1.0000021.0000031.000002age_c0.9613030.9644550.962878sex1.5774211.7439101.658578white_ethnicity1.3444121.5228731.430863
Confounders analysis
As all four variables coefficient have significant p-value (<< 0.05), no confounders are present in this model.
But as the pseudo R-Square has a really low value, the model does not really explain well the response variable. And so there is maybe a confounder variable that I have not test for.
Summary
From the oods ratios results, we can conclude that:
People with white ethnicity are more likely to have ever used cannabis (OR=1.43, 95% confidence int. [1.34, 1.52], p<.0005)
So the results support the hypothesis between our primary explanatory variable (white ethnicity) and the reponse variable (ever used cannabis)
Male are more likely to have ever used cannabis than female (OR=1.66, 95% CI=[1.58, 1.74], p<.0005)
People aged of less than 46 are more likely to have ever used cannabis (OR=0.963, 95% CI=[0.961, 0.964], p<.0005)
Regarding the last explanatory variable (family income), I don't if I can really conclude. Indeed from the strict resuts, people coming from richer family are more likely to have ever used cannabis (OR=1.000002, 95% CI=[1.000002, 1.000003], p<.0005). But the odds ratio is so close to 1.0 than I don't know if the difference is significant.
0 notes
coursera2022 · 3 years ago
Text
Multiple Regression
Assignment: Test a Multiple Regression Model
Following is the Python program I wrote to fulfill the third assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Assignment research question
Using the Gapminder database, I would like to see if there is a relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).
The following variables will be tested also to improve the prediction model and figure out potential confounders:
Employment rate (total employees age of 15+)
Oil consumption per person (tonnes per year per person)
Urban rate (Urban population in %)
Data management
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.
In [2]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display
In [3]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename, low_memory=False) data = data.set_index('country')
General information on the Gapminder data
In [4]:display(Markdown("Number of countries: {}".format(len(data))))
Number of countries: 213
Data managment
In order to carry out the regression analysis, we need to center the potential explanatory variables.
In [5]:subdata = (data[['incomeperperson', 'relectricperperson', 'employrate', 'oilperperson', 'urbanrate']] .assign(income=lambda x: pd.to_numeric(x['incomeperperson'], errors='coerce'), electricity=lambda x: pd.to_numeric(x['relectricperperson'], errors='coerce'), employ=lambda x: pd.to_numeric(x['employrate'], errors='coerce'), oil=lambda x: pd.to_numeric(x['oilperperson'], errors='coerce'), urban=lambda x: pd.to_numeric(x['urbanrate'], errors='coerce')) .dropna()) display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(subdata)))) centered_data = (subdata.assign(income_c=lambda x: x['income'] - subdata['income'].mean(), employ_c=lambda x: x['employ'] - subdata['employ'].mean(), oil_c=lambda x: x['oil'] - subdata['oil'].mean(), urban_c=lambda x: x['urban'] - subdata['urban'].mean()))
Number of countries after discarding countries with missing data: 61
Let's check the variables are well centered
In [6]:display(Markdown("Mean of income per person after centering: {:3g}".format(centered_data['income_c'].mean()))) display(Markdown("Mean of employment rate after centering: {:3g}".format(centered_data['employ_c'].mean()))) display(Markdown("Mean of oil consumption per person after centering: {:3g}".format(centered_data['oil_c'].mean()))) display(Markdown("Mean of urban rate after centering: {:3g}".format(centered_data['urban_c'].mean())))
Mean of income per person after centering: -1.77426e-12
Mean of employment rate after centering: -2.56261e-15
Mean of oil consumption per person after centering: 5.49651e-16
Mean of urban rate after centering: 1.9569e-14
Bivariate distribution
Before looking at the multiple regression analysis, an polynomial regression will be applied on the data to see if it fits better the results.
In [7]:sns.regplot(x='income_c', y='electricity', data=centered_data) sns.regplot(x='income_c', y='electricity', order=2, data=centered_data) plt.xlabel('Centered income per person (2000 US$)') plt.ylabel('Residential electricity consumption (kWh)') plt.title('Scatterplot for the association between the income and the residential electricity consumption');
OLS regression model
Test the linear regression model.
In [8]:reg1 = smf.ols('electricity ~ income_c', data=centered_data).fit() reg1.summary()
Out[8]:
OLS Regression ResultsDep. Variable:electricityR-squared:0.402Model:OLSAdj. R-squared:0.392Method:Least SquaresF-statistic:39.69Date:Sun, 17 Jul 2016Prob (F-statistic):4.08e-08Time:19:02:59Log-Likelihood:-530.88No. Observations:61AIC:1066.Df Residuals:59BIC:1070.Df Model:1Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603189.6708.5740.0001246.731 2005.790income_c0.09500.0156.3000.0000.065 0.125Omnibus:80.246Durbin-Watson:2.135Prob(Omnibus):0.000Jarque-Bera (JB):1091.497Skew:3.660Prob(JB):9.65e-238Kurtosis:22.387Cond. No.1.26e+04
Test a second order polynomial regression model.
In [9]:reg2 = smf.ols('electricity ~ income_c + I(income_c**2)', data=centered_data).fit() reg2.summary()
Out[9]:
OLS Regression ResultsDep. Variable:electricityR-squared:0.413Model:OLSAdj. R-squared:0.393Method:Least SquaresF-statistic:20.43Date:Sun, 17 Jul 2016Prob (F-statistic):1.92e-07Time:19:02:59Log-Likelihood:-530.31No. Observations:61AIC:1067.Df Residuals:58BIC:1073.Df Model:2Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1904.3071325.7615.8460.0001252.225 2556.389income_c0.11170.0225.0950.0000.068 0.156I(income_c ** 2)-1.758e-061.68e-06-1.0490.298-5.11e-06 1.6e-06Omnibus:76.908Durbin-Watson:2.098Prob(Omnibus):0.000Jarque-Bera (JB):912.307Skew:3.504Prob(JB):7.85e-199Kurtosis:20.602Cond. No.3.92e+08
From the second OLS analysis, we can see that the coefficient corresponding to the second order term has a p-value of 0.3 > 0.05. Therefore we should keep only the linear term for our primary relation.
Multiple regression
The multiple regression will be now carried out.
In [10]:reg3 = smf.ols('electricity ~ income_c + oil_c + employ_c + urban_c', data=centered_data).fit() reg3.summary()
Out[10]:
OLS Regression ResultsDep. Variable:electricityR-squared:0.452Model:OLSAdj. R-squared:0.412Method:Least SquaresF-statistic:11.53Date:Sun, 17 Jul 2016Prob (F-statistic):6.76e-07Time:19:03:00Log-Likelihood:-528.25No. Observations:61AIC:1067.Df Residuals:56BIC:1077.Df Model:4Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603186.4708.7210.0001252.716 1999.804income_c0.07580.0203.8830.0000.037 0.115oil_c123.7307135.5640.9130.365-147.837 395.298employ_c48.676525.9011.8790.065-3.209 100.562urban_c2.357113.8640.1700.866-25.415 30.129Omnibus:62.530Durbin-Watson:2.146Prob(Omnibus):0.000Jarque-Bera (JB):537.653Skew:2.704Prob(JB):1.78e-117Kurtosis:16.502Cond. No.1.26e+04
Unexpectedly all explanatory variables except the primary one (income per person) should be discarded as their coefficients have p-values higher than 0.05.
Finally we will look at the diagnostic graphs
Q-Q plot for normality
In [11]:fig = sm.qqplot(reg1.resid, line='r')
The residuals do not follow correctly the line. Especially on the second half of the data. As our model is a single linear regression between residential electricity consumption and income per person, this means that the model is a poor predicator for country having larger income.
Residuals plot
In [12]:stdres = pd.DataFrame(reg1.resid_pearson) plt.plot(stdres, 'o') plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number');
The previous plot highlights only one clear extreme outlier. So this confirm that the model is fine and could be improve.
Partial regression plots
In [13]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'urban_c', fig);
In [19]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'oil_c', fig);
In [20]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'employ_c', fig);
The partial regression plots above are shown for the sake of the assignement as all variables but the income per person are non-significant in the multiple regression model. They all show that some extreme outliers will be present.
And the partial plot against the urban rate has a horizontal partial regression line. This confirms that urban rate cannot improve the model.
Leverage plot
In [14]:sm.graphics.influence_plot(reg1);
The leverage plot above shows that our extreme outlier United Arab Emirates does not have a strong influence on the estimation of the model coefficient. On the contrary Norway, the second border highest residual, has a important influence on the model estimation.
Analysis of trouble case
To conclude this assignment, I would to take the same question but taking the oil consumption as primary explanatory variables.
Let's first see if a second order fits better than the linear regression model.
In [15]:sns.regplot(x='oil_c', y='electricity', data=centered_data); sns.regplot(x='oil_c', y='electricity', order=2, data=centered_data) plt.xlabel('Centered oil consumption per person (tonnes)') plt.ylabel('Residential electricity consumption (kWh)') plt.title('Scatterplot for the association between the oil and the residential electricity consumption');
In [16]:reg_oil1 = smf.ols('electricity ~ oil_c', data=centered_data).fit() reg_oil1.summary()
Out[16]:
OLS Regression ResultsDep. Variable:electricityR-squared:0.199Model:OLSAdj. R-squared:0.185Method:Least SquaresF-statistic:14.64Date:Sun, 17 Jul 2016Prob (F-statistic):0.000317Time:19:03:39Log-Likelihood:-539.82No. Observations:61AIC:1084.Df Residuals:59BIC:1088.Df Model:1Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603219.5837.4060.0001186.877 2065.644oil_c487.8025127.5053.8260.000232.666 742.939Omnibus:47.149Durbin-Watson:1.867Prob(Omnibus):0.000Jarque-Bera (JB):273.869Skew:1.975Prob(JB):3.39e-60Kurtosis:12.599Cond. No.1.72
In [17]:reg_oil2 = smf.ols('electricity ~ oil_c + I(oil_c**2)', data=centered_data).fit() reg_oil2.summary()
Out[17]:
OLS Regression ResultsDep. Variable:electricityR-squared:0.595Model:OLSAdj. R-squared:0.581Method:Least SquaresF-statistic:42.58Date:Sun, 17 Jul 2016Prob (F-statistic):4.18e-12Time:19:03:39Log-Likelihood:-519.02No. Observations:61AIC:1044.Df Residuals:58BIC:1050.Df Model:2Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept2079.9567168.62012.3350.0001742.428 2417.486oil_c1617.3225175.6839.2060.0001265.655 1968.990I(oil_c ** 2)-152.975820.316-7.5300.000-193.643 -112.309Omnibus:46.176Durbin-Watson:1.686Prob(Omnibus):0.000Jarque-Bera (JB):230.695Skew:2.004Prob(JB):8.04e-51Kurtosis:11.643Cond. No.19.1
From the OLS analysis, the second order regression fits better the results. But the outlier far on the right seems to deteriorate the accuracy of the linear regression coefficient.
This is confirm by the leverage plot below. Singapore has a strong influence on the regression model. This country is a singularity having a huge oil consumption but a reasonable residential electricity consumption.
In [18]:sm.graphics.influence_plot(reg_oil1);
Anyway the multiple regression shows that oil consumption is not a significant explanatory variable of the residential electricity consumption. Indeed in this case income per person is a cofounder; that is the real explanatory variable.
Conclusion
In this assignment we have seen that only our primary explanatory variable (income per person) is a good to build a regression model on. However the R-Squared value being 0.4, there is still 60% of the electricity consumption variations not explain with the model; in particular the model performs poorly for country with higher income.
In the latter part, we use the tools described in this week to the trouble than can be raised by outlier data.
0 notes
coursera2022 · 3 years ago
Text
Linear
Assignment: Test a Basic Linear Regression Model
Following is the Python program I wrote to fulfill the second assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Assignment research question
Using the Gapminder database, I would like to see if there is a linear relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).
Data management
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.
In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.formula.api as smf import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display
In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename, low_memory=False) data = data.set_index('country')
General information on the Gapminder data
In [3]:display(Markdown("Number of countries: {}".format(len(data)))) display(Markdown("Number of variables: {}".format(len(data.columns))))
Number of countries: 213
Number of variables: 15
Variables distribution
Before computing the linear regression between the icome per person and the residential electricity consumption, let's have a look at the distributions of the variables.
In [4]:subdata2 = (data[['incomeperperson', 'relectricperperson']] .assign(income=lambda x: pd.to_numeric(data['incomeperperson'], errors='coerce'), electricity=lambda x: pd.to_numeric(data['relectricperperson'], errors='coerce')) .dropna())
Income per person
In [5]:sns.distplot(subdata2['income'], kde=False) plt.xlabel('Income per person (2000 US$)');
From the distribution graph, we can see that the distribution is skewed-right and bimodal. There is also a singular case with high income per capita. That country is :
In [13]:subdata2.loc[subdata2['income'] > 45000]
Out[13]:incomeperpersonrelectricperpersonelectricityincomecountryLuxembourg52301.58717899841566.106138607341566.10613952301.587179
In [15]:subdata2['income'].describe()
Out[15]:count 130.000000 mean 8784.531620 std 11420.725637 min 103.775857 25% 1156.551493 50% 3172.679153 75% 11606.578377 max 52301.587179 Name: income, dtype: float64
For this assignment the explanatory variable income per person will have to be centered as its mean is 8784.5.
Residential electricity consumption
In [6]:sns.distplot(subdata2['electricity'], kde=False) plt.xlabel('Residential electricity consumption (kWh)');
In [16]:subdata2['electricity'].describe()
Out[16]:count 130.000000 mean 1148.615676 std 1584.959430 min 0.000000 25% 226.318460 50% 609.335172 75% 1484.702878 max 11154.755033 Name: electricity, dtype: float64
The residential electricity consumption is also skewed-right. And there are also a couple of countries presenting unusual higher values. 
0 notes
coursera2022 · 3 years ago
Text
Data
Research question
Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.
So for this assignment, the three following variables will be analyzed:
Internet Usage Rate (per 100 people)
Suicide Rate (per 100 000 people)
Employment Rate (% of the population of age 15+)
As the Gapminder database is an aggregation of data from different sources, I will described the data separately for the different variables.
About my data
Sample
The sample comes from the Gapminder database (http://www.gapminder.org/). This is a non-profit organization founded in Stockholm to promote sustainable global development in order to achieve the United Nations Millenium Development Goals.
The data are gathered for 215 areas (192 UN members, Serbia and Montenegro being aggregated + 24 areas). But not all indicators are available for all countries.
Although the database provides time series for all indicators on an yearly base, the sample used for this class uses data of a certain year depending on the indicator :
Internet Usage Rate (per 100 people): data used are for 2010
Suicide per 100 000 people: data used are for 2008
Employment rate for people of age 15+: data used are for 2007
Procedure
The three indicators are collected by different organizations :
Internet Usage Rate (per 100 people):
Data from the World Bank (http://databank.worldbank.org/data/home.aspx)
The data are computed as the weighted average of different sources: International Telecommunication Union, World Telecommunication/ICT Development Report and database, and World Bank estimates.
Suicide per 100 000 people: data used are for 2008
Data from the World Health Organization (WHO) (http://www.who.int/violence_injury_prevention/surveillance/databases/mortality/en/)
The database is built on annual report by the 120 Member States from their civil registration systems.
Employment rate for people of age 15+: data used are for 2007
Data from the International Labour Organization (ILO) (http://www.ilo.org/emppolicy/lang--en/index.htm)
ILO publishes every two years since 1999, 18 key indicators of the labour market including the employment rate.
The data were collected by different methods depending on the countries. The precise list is available there http://www.ilo.org/ilostat. Those methods are Population census, Official estimate, Administrative records, Population register, Household surveys, Labour force survey, Household income/expenditure survey or Other household survey.
Measures
All three indicators are constructed mainly on report from the member states.
The explanatory variable Internet Usage Rate (per 100 people):
Definition : Internet users are defined as individuals who used the Internet int he last 12 months.
Scale : 0 to 100
Management : Suppression of countries with no-data provided
The response variable Suicide, age adjusted, per 100 000 people:
Definition : Mortality due to self-inflicted injury, per 100 000 standard population, age adjusted. Combination of data from WHO Violence and Injury Prevention (VIP) and from WHO Global Burden of Disease.
Scale : 0 to 100 000
Management : Suppression of countries with no-data provided
Another explanatory variable Employment rate for people of age 15+:
Definiton : Percentage of total population, age above 15, that has been employed during the given year.
Scale : 0% to 100%
Management : Suppression of countries with no-data provided
0 notes
coursera2022 · 3 years ago
Text
K Means
I decided to use the same research question than for the previous assignment on Lasso regression analysis.
Using the Gapminder database, I would like to see the variables that are the most influencing income per person (2010 GDP per capita in constant 2000 US$). It will therefore be my test variable.
The cluster analysis will be carried out on the following variables:
Residential electricity consumption (per person in kWh)
CO2 emissions (in metric tons)
Employment rate (total employees age 15+ in % of the population)
Internet use rate (Internet users per 100 people)
Life expectancy (Life expectancy at birth in years)
Polity score (Democracy score; the scale ranges from -10 (the lowest) to 10 (the highest))
Urban rate (Urban population in %)
Data management
The countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.
In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename) data = data.set_index('country')
General information on the Gapminder data
In [3]:display(Markdown("Number of countries: {}".format(len(data))))
Number of countries: 213
Predictors selections and standardization
In [4]:explanatory_vars = ['employrate', 'urbanrate', 'polityscore', 'lifeexpectancy', 'internetuserate', 'relectricperperson'] test_var = 'incomeperperson' constructor_dict = dict() for var in explanatory_vars + [test_var, ]: constructor_dict[var] = pd.to_numeric(data[var], errors='coerce') numeric_data = pd.DataFrame(constructor_dict, index=data.index).dropna() display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(numeric_data))))
Number of countries after discarding countries with missing data: 122
In [5]:predictors = numeric_data[explanatory_vars] target = numeric_data[test_var] # Standardize predictors to have mean=0 and std=1 std_predictors = predictors.copy() for var in std_predictors.columns: std_predictors[var] = preprocessing.scale(std_predictors[var].astype('float64')) # Check standardization std_predictors.describe()
Out[5]:
0 notes
coursera2022 · 3 years ago
Text
Lasso regression
Using the Gapminder database, I would like to see the variables that are the most influencing the income per person (2010 GDP per capita in constant 2000 US$) from the following list of variables:
Residential electricity consumption (per person in kWh)
CO2 emissions (in metric tons)
Employ rate (total employees age 15+ in % of the population)
Internet use rate (Internet users per 100 people)
Life expectancy (Life expectancy at birth in years)
Polity score (Democracty score; the scale ranges from -10 (the lowest) to 10 (the highest))
Urban rate (Urban population in %)
Data management
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.
In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename) data = data.set_index('country')
General information on the Gapminder data
In [3]:display(Markdown("Number of countries: {}".format(len(data))))
Number of countries: 213
Predictors selections and standardization
In [4]:explanatory_vars = ['employrate', 'urbanrate', 'polityscore', 'lifeexpectancy', 'internetuserate', 'relectricperperson'] response_var = 'incomeperperson' constructor_dict = dict() for var in explanatory_vars + [response_var, ]: constructor_dict[var] = pd.to_numeric(data[var], errors='coerce') numeric_data = pd.DataFrame(constructor_dict, index=data.index).dropna() display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(numeric_data))))
Number of countries after discarding countries with missing data: 122
The number of countries has unfortunately been almost divided by two due to missing data. Nevertheless the k-fold cross-validation technique will be used.
In [5]:predictors = numeric_data[explanatory_vars] target = numeric_data[response_var] # Standardize predictors to have mean=0 and std=1from sklearn import preprocessing std_predictors = predictors.copy() for var in std_predictors.columns: std_predictors[var] = preprocessing.scale(std_predictors[var].astype('float64')) # Check standardization std_predictors.describe()
0 notes
coursera2022 · 3 years ago
Text
The blog for Random forest
The explanatory variables will be:
Age (Quantitative variable)
Sex (0 = Female, 1 = Male)
Family income (grouped in 5 categories)
Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
Ethnicity (Each of them are recoded 0 = No, 1 = Yes):
White
Black
Native American
Pacific islander
Asian
Hispanic
Raised by who (Each of them are recoded 0 = No, 1 = Yes):
Biological parents (i.e. lives with at least one biological parent)
Adoptive parents
Relatives
Foster parents
In institution
Other
Did biological parents get divorced (0 = No, 1 = Yes)
Highest grade completed : coded from 1 (no schooling) to 14 (completed graduate)
Occupation (Each of them are recoded 0 = No, 1 = Yes):
Full time (35+ hours a week)
Part time (<35 hours a week)
Unemployed and looking for work
Unemployed and not looking for work
Data management
The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.
The response variable having 2 categories, categories grouping is not needed.
In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.formula.api as smf import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display, Image
In [3]:from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier
In [4]:nesarc = pd.read_csv('nesarc_pds.csv') C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
In [45]:boolean_flip_ignore = {1 : 1, 2 : 0, 9 : np.nan} boolean_flip_ignore_str = {'1' : 1, '2' : 0, '9' : np.nan, ' ' : 0} boolean_flip = {1 : 1, 2 : 0} # Group the family income in two groups splitting it on the second quartile classdef income_group(value): if value < 5: return 0 elif value < 9: return 1 elif value < 13: return 2 elif value < 17: return 3 else: return 4 subnesarc = (nesarc[['AGE', 'SEX', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A', 'S1Q1C', 'S1Q1D1', 'S1Q1D2', 'S1Q1D3', 'S1Q1D4', 'S1Q1D5', 'S1Q2A', 'S1Q2C1', 'S1Q2C2', 'S1Q2C3', 'S1Q2C4', 'S1Q2C5', 'S1Q2D', 'S1Q6A', 'S1Q7A1', 'S1Q7A2', 'S1Q7A6', 'S1Q7A7']] .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(boolean_flip)), native=lambda x: pd.to_numeric(x['S1Q1D1'].map(boolean_flip)), asian=lambda x: pd.to_numeric(x['S1Q1D2'].map(boolean_flip)), black=lambda x: pd.to_numeric(x['S1Q1D3'].map(boolean_flip)), pacific=lambda x: pd.to_numeric(x['S1Q1D4'].map(boolean_flip)), white=lambda x: pd.to_numeric(x['S1Q1D5'].map(boolean_flip)), used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce') .map(boolean_flip_ignore)), family_income=lambda x: pd.to_numeric(x['S1Q11B'].map(income_group)), smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce') .map(boolean_flip_ignore)), bioparent=lambda x: (pd.to_numeric(x['S1Q2A'], errors='coerce') .map(boolean_flip_ignore)), adoptiveparent=lambda x: (pd.to_numeric(x['S1Q2C1'].map(boolean_flip_ignore_str), errors='coerce')), byrelative=lambda x: (pd.to_numeric(x['S1Q2C2'].map(boolean_flip_ignore_str), errors='coerce')), byfoster=lambda x: (pd.to_numeric(x['S1Q2C3'].map(boolean_flip_ignore_str), errors='coerce')), ininstitution=lambda x: (pd.to_numeric(x['S1Q2C4'].map(boolean_flip_ignore_str), errors='coerce')), otherraised=lambda x: (pd.to_numeric(x['S1Q2C5'].map(boolean_flip_ignore_str), errors='coerce')), divorced=lambda x: (pd.to_numeric(x['S1Q2D'].map(boolean_flip_ignore_str), errors='coerce')), grade=lambda x: (pd.to_numeric(x['S1Q6A'], errors='coerce')), fulltime=lambda x: pd.to_numeric(x['S1Q7A1'].map(boolean_flip)), parttime=lambda x: pd.to_numeric(x['S1Q7A2'].map(boolean_flip)), unemployedsearch=lambda x: pd.to_numeric(x['S1Q7A6'].map(boolean_flip)), unemployed=lambda x: pd.to_numeric(x['S1Q7A7'].map(boolean_flip)) ) .dropna())[['sex', 'AGE', 'native', 'asian', 'black', 'pacific', 'white', 'used_canabis', 'family_income', 'smoked_100cigarettes', 'bioparent', 'adoptiveparent', 'byrelative', 'byfoster', 'ininstitution', 'otherraised', 'divorced', 'grade', 'fulltime', 'parttime', 'unemployedsearch', 'unemployed']]
0 notes
coursera2022 · 3 years ago
Text
Decision Tree
The Decision Tree algorithm was applied to the NESARC database to influence of the following explanatory variables on the 'ever used cannabis' indicator:
Age (Quantitative variable)
Sex (0 = Female, 1 = Male)
Family income (grouped in 5 categories)
Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
White Ethnicity (2 = No, 1 = Yes)
The decision tree is build using the genie index as splitting criteria.
The confusion matrix shows that the tested model predicts correctly on the test sample 12 933 people as never used cannabis and 627 as ever used cannabis. But 2660 people having used cannabis in reality were miss evaluated as well as 767 people never having used cannabis in reality.
The accuracy score is 80%. But this high results is due to the big proportion of people having never used cannabis. Indeed the prediction for people having ever used cannabis is very poor with more people being predicted as never used cannabis.
The decision tree is huge with lots of nodes as expected with the Python approach. I used the option filled=True to fill the nodes with a color indicating majority class.
The first node is a split on the AGE criteria <= 55.5. For people being younger (13313 having used cannabis and 4577 having never used it) than the threshold, the second criteria is whether or not then ever smoked 100+ cigarettes. Those having smoked 100+ cigarettes are more likely to have ever used cannabis. Then comes the third level for that groups splitting according to the sex. This time the tree implies an higher chance for male to have ever used cannabis. And then those men are split according the family income. This split allowing the fourth category. The richer one are less likely to have ever tried cannabis.
I can go one. But I will stop here as the precised description does not really bring much.
The main conclusion from the decision tree is a split of behavior depending on the age. Then smoker and non-smoker group is a second important criteria. And after that comes sex.
1 note · View note