coursera2022 - Tumblr blog

coursera2022 · 3 years ago

Text

Logistic Regression

Assignment: Test a Logistic Regression Model

Following is the Python program I wrote to fulfill the fourth assignment of the Regression Modeling in Practice online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question for this assignment

For this assignment, I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?

The potential other explanatory variables will be:

Age

Sex

Family income

Data management

The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.

The response variable having 2 categories, categories grouping is not needed.

The other categorical variable (sex) will be recoded such that 0 means female and 1 equals male. And the two quantitative explanatory variables (age and family income) will be centered.

In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.formula.api as smf import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display

In [2]:nesarc = pd.read_csv('nesarc_pds.csv') C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

In [3]:canabis_usage = {1 : 1, 2 : 0, 9 : 9} sex_shift = {1 : 1, 2 : 0} white_race = {1 : 1, 2 : 0} subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A']] .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)), white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)), used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce') .map(canabis_usage) .replace(9, np.nan)), family_income=lambda x: (pd.to_numeric(x['S1Q11A'], errors='coerce'))) .dropna()) centered_nesarc = subnesarc.assign(age_c=subnesarc['AGE']-subnesarc['AGE'].mean(), family_income_c=subnesarc['family_income']-subnesarc['family_income'].mean())

In [4]:display(Markdown("Mean age : {:.0f}".format(centered_nesarc['AGE'].mean()))) display(Markdown("Mean family income last year: {:.0f}$".format(centered_nesarc['family_income'].mean())))

Mean age : 46

Mean family income last year: 45631$

Let's check that the quantitative variable are effectively centered.

In [5]:print("Centered age") print(centered_nesarc['age_c'].describe()) print("\nCentered family income") print(centered_nesarc['family_income_c'].describe()) Centered age count 4.272500e+04 mean -2.667486e-13 std 1.819181e+01 min -2.841439e+01 25% -1.441439e+01 50% -2.414394e+00 75% 1.258561e+01 max 5.158561e+01 Name: age_c, dtype: float64 Centered family income count 4.272500e+04 mean -5.710829e-10 std 5.777221e+04 min -4.560694e+04 25% -2.863094e+04 50% -1.263094e+04 75% 1.436906e+04 max 2.954369e+06 Name: family_income_c, dtype: float64

The means are both very close to 0; confirming the centering.

Distributions visualization

The following plots shows the distribution of all 3 explanatory variables with the response variable.

In [6]:g = sns.factorplot(x='white_ethnicity', y='used_canabis', data=centered_nesarc, kind="bar", ci=None) g.set_xticklabels(['Non White', 'White']) plt.xlabel('White ethnicity') plt.ylabel('Ever used cannabis') plt.title('Ever used cannabis dependance on the white ethnicity');

In [7]:g = sns.factorplot(x='sex', y='used_canabis', data=centered_nesarc, kind="bar", ci=None) g.set_xticklabels(['Female', 'Male']) plt.ylabel('Ever used cannabis') plt.title('Ever used cannabis dependance on the sex');

In [8]:g = sns.boxplot(x='used_canabis', y='family_income', data=centered_nesarc) g.set_yscale('log') g.set_xticklabels(('No', 'Yes')) plt.xlabel('Ever used cannabis') plt.ylabel('Family income ($)');

In [9]:g = sns.boxplot(x='used_canabis', y='AGE', data=centered_nesarc) g.set_xticklabels(('No', 'Yes')) plt.xlabel('Ever used cannabis') plt.ylabel('Age');

The four plots above show the following trends:

More white people tries cannabis more than non-white

Male people tries cannabis more than female

Younger people tries cannabis more than older ones

Man from richer families tries cannabis more than those from poorer families

Logistic regression model

The plots showed the direction of a potential relationship. But a rigorous statistical test has to be carried out to confirm the four previous hypothesis.

The following code will test a logistic regression model on our hypothesis.

In [10]:model = smf.logit(formula='used_canabis ~ family_income_c + age_c + sex + white_ethnicity', data=centered_nesarc).fit() model.summary() Optimization terminated successfully. Current function value: 0.451313 Iterations 6

Out[10]:

Logit Regression ResultsDep. Variable:used_canabisNo. Observations:42725Model:LogitDf Residuals:42720Method:MLEDf Model:4Date:Sun, 24 Jul 2016Pseudo R-squ.:0.07529Time:16:15:55Log-Likelihood:-19282.converged:TrueLL-Null:-20852.LLR p-value:0.000coefstd errzP>|z|[95.0% Conf. Int.]Intercept-2.10430.032-66.7640.000-2.166 -2.043family_income_c2.353e-062.16e-0710.8800.0001.93e-06 2.78e-06age_c-0.03780.001-45.2880.000-0.039 -0.036sex0.50600.02619.7660.0000.456 0.556white_ethnicity0.35830.03211.2680.0000.296 0.421

In [11]:params = model.params conf = model.conf_int() conf['Odds Ratios'] = params conf.columns = ['Lower Conf. Int.', 'Upper Conf. Int.', 'Odds Ratios'] np.exp(conf)

Out[11]:Lower Conf. Int.Upper Conf. Int.Odds RatiosIntercept0.1146250.1296990.121930family_income_c1.0000021.0000031.000002age_c0.9613030.9644550.962878sex1.5774211.7439101.658578white_ethnicity1.3444121.5228731.430863

Confounders analysis

As all four variables coefficient have significant p-value (<< 0.05), no confounders are present in this model.

But as the pseudo R-Square has a really low value, the model does not really explain well the response variable. And so there is maybe a confounder variable that I have not test for.

Summary

From the oods ratios results, we can conclude that:

People with white ethnicity are more likely to have ever used cannabis (OR=1.43, 95% confidence int. [1.34, 1.52], p<.0005)

So the results support the hypothesis between our primary explanatory variable (white ethnicity) and the reponse variable (ever used cannabis)

Male are more likely to have ever used cannabis than female (OR=1.66, 95% CI=[1.58, 1.74], p<.0005)

People aged of less than 46 are more likely to have ever used cannabis (OR=0.963, 95% CI=[0.961, 0.964], p<.0005)

Regarding the last explanatory variable (family income), I don't if I can really conclude. Indeed from the strict resuts, people coming from richer family are more likely to have ever used cannabis (OR=1.000002, 95% CI=[1.000002, 1.000003], p<.0005). But the odds ratio is so close to 1.0 than I don't know if the difference is significant.

0 notes

coursera2022 · 3 years ago

Text

Multiple Regression

Assignment: Test a Multiple Regression Model

Following is the Python program I wrote to fulfill the third assignment of the Regression Modeling in Practice online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Assignment research question

Using the Gapminder database, I would like to see if there is a relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).

The following variables will be tested also to improve the prediction model and figure out potential confounders:

Employment rate (total employees age of 15+)

Oil consumption per person (tonnes per year per person)

Urban rate (Urban population in %)

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.

In [2]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display

In [3]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename, low_memory=False) data = data.set_index('country')

General information on the Gapminder data

In [4]:display(Markdown("Number of countries: {}".format(len(data))))

Number of countries: 213

Data managment

In order to carry out the regression analysis, we need to center the potential explanatory variables.

In [5]:subdata = (data[['incomeperperson', 'relectricperperson', 'employrate', 'oilperperson', 'urbanrate']] .assign(income=lambda x: pd.to_numeric(x['incomeperperson'], errors='coerce'), electricity=lambda x: pd.to_numeric(x['relectricperperson'], errors='coerce'), employ=lambda x: pd.to_numeric(x['employrate'], errors='coerce'), oil=lambda x: pd.to_numeric(x['oilperperson'], errors='coerce'), urban=lambda x: pd.to_numeric(x['urbanrate'], errors='coerce')) .dropna()) display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(subdata)))) centered_data = (subdata.assign(income_c=lambda x: x['income'] - subdata['income'].mean(), employ_c=lambda x: x['employ'] - subdata['employ'].mean(), oil_c=lambda x: x['oil'] - subdata['oil'].mean(), urban_c=lambda x: x['urban'] - subdata['urban'].mean()))

Number of countries after discarding countries with missing data: 61

Let's check the variables are well centered

In [6]:display(Markdown("Mean of income per person after centering: {:3g}".format(centered_data['income_c'].mean()))) display(Markdown("Mean of employment rate after centering: {:3g}".format(centered_data['employ_c'].mean()))) display(Markdown("Mean of oil consumption per person after centering: {:3g}".format(centered_data['oil_c'].mean()))) display(Markdown("Mean of urban rate after centering: {:3g}".format(centered_data['urban_c'].mean())))

Mean of income per person after centering: -1.77426e-12

Mean of employment rate after centering: -2.56261e-15

Mean of oil consumption per person after centering: 5.49651e-16

Mean of urban rate after centering: 1.9569e-14

Bivariate distribution

Before looking at the multiple regression analysis, an polynomial regression will be applied on the data to see if it fits better the results.

In [7]:sns.regplot(x='income_c', y='electricity', data=centered_data) sns.regplot(x='income_c', y='electricity', order=2, data=centered_data) plt.xlabel('Centered income per person (2000 US$)') plt.ylabel('Residential electricity consumption (kWh)') plt.title('Scatterplot for the association between the income and the residential electricity consumption');

OLS regression model

Test the linear regression model.

In [8]:reg1 = smf.ols('electricity ~ income_c', data=centered_data).fit() reg1.summary()

Out[8]:

OLS Regression ResultsDep. Variable:electricityR-squared:0.402Model:OLSAdj. R-squared:0.392Method:Least SquaresF-statistic:39.69Date:Sun, 17 Jul 2016Prob (F-statistic):4.08e-08Time:19:02:59Log-Likelihood:-530.88No. Observations:61AIC:1066.Df Residuals:59BIC:1070.Df Model:1Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603189.6708.5740.0001246.731 2005.790income_c0.09500.0156.3000.0000.065 0.125Omnibus:80.246Durbin-Watson:2.135Prob(Omnibus):0.000Jarque-Bera (JB):1091.497Skew:3.660Prob(JB):9.65e-238Kurtosis:22.387Cond. No.1.26e+04

Test a second order polynomial regression model.

In [9]:reg2 = smf.ols('electricity ~ income_c + I(income_c**2)', data=centered_data).fit() reg2.summary()

Out[9]:

OLS Regression ResultsDep. Variable:electricityR-squared:0.413Model:OLSAdj. R-squared:0.393Method:Least SquaresF-statistic:20.43Date:Sun, 17 Jul 2016Prob (F-statistic):1.92e-07Time:19:02:59Log-Likelihood:-530.31No. Observations:61AIC:1067.Df Residuals:58BIC:1073.Df Model:2Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1904.3071325.7615.8460.0001252.225 2556.389income_c0.11170.0225.0950.0000.068 0.156I(income_c ** 2)-1.758e-061.68e-06-1.0490.298-5.11e-06 1.6e-06Omnibus:76.908Durbin-Watson:2.098Prob(Omnibus):0.000Jarque-Bera (JB):912.307Skew:3.504Prob(JB):7.85e-199Kurtosis:20.602Cond. No.3.92e+08

From the second OLS analysis, we can see that the coefficient corresponding to the second order term has a p-value of 0.3 > 0.05. Therefore we should keep only the linear term for our primary relation.

Multiple regression

The multiple regression will be now carried out.

In [10]:reg3 = smf.ols('electricity ~ income_c + oil_c + employ_c + urban_c', data=centered_data).fit() reg3.summary()

Out[10]:

OLS Regression ResultsDep. Variable:electricityR-squared:0.452Model:OLSAdj. R-squared:0.412Method:Least SquaresF-statistic:11.53Date:Sun, 17 Jul 2016Prob (F-statistic):6.76e-07Time:19:03:00Log-Likelihood:-528.25No. Observations:61AIC:1067.Df Residuals:56BIC:1077.Df Model:4Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603186.4708.7210.0001252.716 1999.804income_c0.07580.0203.8830.0000.037 0.115oil_c123.7307135.5640.9130.365-147.837 395.298employ_c48.676525.9011.8790.065-3.209 100.562urban_c2.357113.8640.1700.866-25.415 30.129Omnibus:62.530Durbin-Watson:2.146Prob(Omnibus):0.000Jarque-Bera (JB):537.653Skew:2.704Prob(JB):1.78e-117Kurtosis:16.502Cond. No.1.26e+04

Unexpectedly all explanatory variables except the primary one (income per person) should be discarded as their coefficients have p-values higher than 0.05.

Finally we will look at the diagnostic graphs

Q-Q plot for normality

In [11]:fig = sm.qqplot(reg1.resid, line='r')

The residuals do not follow correctly the line. Especially on the second half of the data. As our model is a single linear regression between residential electricity consumption and income per person, this means that the model is a poor predicator for country having larger income.

Residuals plot

In [12]:stdres = pd.DataFrame(reg1.resid_pearson) plt.plot(stdres, 'o') plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number');

The previous plot highlights only one clear extreme outlier. So this confirm that the model is fine and could be improve.

Partial regression plots

In [13]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'urban_c', fig);

In [19]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'oil_c', fig);

In [20]:fig = plt.figure(figsize=(12,8)) sm.graphics.plot_regress_exog(reg3, 'employ_c', fig);

The partial regression plots above are shown for the sake of the assignement as all variables but the income per person are non-significant in the multiple regression model. They all show that some extreme outliers will be present.

And the partial plot against the urban rate has a horizontal partial regression line. This confirms that urban rate cannot improve the model.

Leverage plot

In [14]:sm.graphics.influence_plot(reg1);

The leverage plot above shows that our extreme outlier United Arab Emirates does not have a strong influence on the estimation of the model coefficient. On the contrary Norway, the second border highest residual, has a important influence on the model estimation.

Analysis of trouble case

To conclude this assignment, I would to take the same question but taking the oil consumption as primary explanatory variables.

Let's first see if a second order fits better than the linear regression model.

In [15]:sns.regplot(x='oil_c', y='electricity', data=centered_data); sns.regplot(x='oil_c', y='electricity', order=2, data=centered_data) plt.xlabel('Centered oil consumption per person (tonnes)') plt.ylabel('Residential electricity consumption (kWh)') plt.title('Scatterplot for the association between the oil and the residential electricity consumption');

In [16]:reg_oil1 = smf.ols('electricity ~ oil_c', data=centered_data).fit() reg_oil1.summary()

Out[16]:

OLS Regression ResultsDep. Variable:electricityR-squared:0.199Model:OLSAdj. R-squared:0.185Method:Least SquaresF-statistic:14.64Date:Sun, 17 Jul 2016Prob (F-statistic):0.000317Time:19:03:39Log-Likelihood:-539.82No. Observations:61AIC:1084.Df Residuals:59BIC:1088.Df Model:1Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept1626.2603219.5837.4060.0001186.877 2065.644oil_c487.8025127.5053.8260.000232.666 742.939Omnibus:47.149Durbin-Watson:1.867Prob(Omnibus):0.000Jarque-Bera (JB):273.869Skew:1.975Prob(JB):3.39e-60Kurtosis:12.599Cond. No.1.72

In [17]:reg_oil2 = smf.ols('electricity ~ oil_c + I(oil_c**2)', data=centered_data).fit() reg_oil2.summary()

Out[17]:

OLS Regression ResultsDep. Variable:electricityR-squared:0.595Model:OLSAdj. R-squared:0.581Method:Least SquaresF-statistic:42.58Date:Sun, 17 Jul 2016Prob (F-statistic):4.18e-12Time:19:03:39Log-Likelihood:-519.02No. Observations:61AIC:1044.Df Residuals:58BIC:1050.Df Model:2Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept2079.9567168.62012.3350.0001742.428 2417.486oil_c1617.3225175.6839.2060.0001265.655 1968.990I(oil_c ** 2)-152.975820.316-7.5300.000-193.643 -112.309Omnibus:46.176Durbin-Watson:1.686Prob(Omnibus):0.000Jarque-Bera (JB):230.695Skew:2.004Prob(JB):8.04e-51Kurtosis:11.643Cond. No.19.1

From the OLS analysis, the second order regression fits better the results. But the outlier far on the right seems to deteriorate the accuracy of the linear regression coefficient.

This is confirm by the leverage plot below. Singapore has a strong influence on the regression model. This country is a singularity having a huge oil consumption but a reasonable residential electricity consumption.

In [18]:sm.graphics.influence_plot(reg_oil1);

Anyway the multiple regression shows that oil consumption is not a significant explanatory variable of the residential electricity consumption. Indeed in this case income per person is a cofounder; that is the real explanatory variable.

Conclusion

In this assignment we have seen that only our primary explanatory variable (income per person) is a good to build a regression model on. However the R-Squared value being 0.4, there is still 60% of the electricity consumption variations not explain with the model; in particular the model performs poorly for country with higher income.

In the latter part, we use the tools described in this week to the trouble than can be raised by outlier data.

0 notes

coursera2022 · 3 years ago

Text

Linear

Assignment: Test a Basic Linear Regression Model

Following is the Python program I wrote to fulfill the second assignment of the Regression Modeling in Practice online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Assignment research question

Using the Gapminder database, I would like to see if there is a linear relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.

In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import statsmodels.formula.api as smf import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display

In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename, low_memory=False) data = data.set_index('country')

General information on the Gapminder data

In [3]:display(Markdown("Number of countries: {}".format(len(data)))) display(Markdown("Number of variables: {}".format(len(data.columns))))

Number of countries: 213

Number of variables: 15

Variables distribution

Before computing the linear regression between the icome per person and the residential electricity consumption, let's have a look at the distributions of the variables.

In [4]:subdata2 = (data[['incomeperperson', 'relectricperperson']] .assign(income=lambda x: pd.to_numeric(data['incomeperperson'], errors='coerce'), electricity=lambda x: pd.to_numeric(data['relectricperperson'], errors='coerce')) .dropna())

Income per person

In [5]:sns.distplot(subdata2['income'], kde=False) plt.xlabel('Income per person (2000 US$)');

From the distribution graph, we can see that the distribution is skewed-right and bimodal. There is also a singular case with high income per capita. That country is :

In [13]:subdata2.loc[subdata2['income'] > 45000]

Out[13]:incomeperpersonrelectricperpersonelectricityincomecountryLuxembourg52301.58717899841566.106138607341566.10613952301.587179

In [15]:subdata2['income'].describe()

Out[15]:count 130.000000 mean 8784.531620 std 11420.725637 min 103.775857 25% 1156.551493 50% 3172.679153 75% 11606.578377 max 52301.587179 Name: income, dtype: float64

For this assignment the explanatory variable income per person will have to be centered as its mean is 8784.5.

Residential electricity consumption

In [6]:sns.distplot(subdata2['electricity'], kde=False) plt.xlabel('Residential electricity consumption (kWh)');

In [16]:subdata2['electricity'].describe()

Out[16]:count 130.000000 mean 1148.615676 std 1584.959430 min 0.000000 25% 226.318460 50% 609.335172 75% 1484.702878 max 11154.755033 Name: electricity, dtype: float64

The residential electricity consumption is also skewed-right. And there are also a couple of countries presenting unusual higher values.

0 notes

coursera2022 · 3 years ago

Text

Data

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this assignment, the three following variables will be analyzed:

Internet Usage Rate (per 100 people)

Suicide Rate (per 100 000 people)

Employment Rate (% of the population of age 15+)

As the Gapminder database is an aggregation of data from different sources, I will described the data separately for the different variables.

About my data

Sample

The sample comes from the Gapminder database (http://www.gapminder.org/). This is a non-profit organization founded in Stockholm to promote sustainable global development in order to achieve the United Nations Millenium Development Goals.

The data are gathered for 215 areas (192 UN members, Serbia and Montenegro being aggregated + 24 areas). But not all indicators are available for all countries.

Although the database provides time series for all indicators on an yearly base, the sample used for this class uses data of a certain year depending on the indicator :

Internet Usage Rate (per 100 people): data used are for 2010

Suicide per 100 000 people: data used are for 2008

Employment rate for people of age 15+: data used are for 2007

Procedure

The three indicators are collected by different organizations :

Internet Usage Rate (per 100 people):

Data from the World Bank (http://databank.worldbank.org/data/home.aspx)

The data are computed as the weighted average of different sources: International Telecommunication Union, World Telecommunication/ICT Development Report and database, and World Bank estimates.

Suicide per 100 000 people: data used are for 2008

Data from the World Health Organization (WHO) (http://www.who.int/violence_injury_prevention/surveillance/databases/mortality/en/)

The database is built on annual report by the 120 Member States from their civil registration systems.

Employment rate for people of age 15+: data used are for 2007

Data from the International Labour Organization (ILO) (http://www.ilo.org/emppolicy/lang--en/index.htm)

ILO publishes every two years since 1999, 18 key indicators of the labour market including the employment rate.

The data were collected by different methods depending on the countries. The precise list is available there http://www.ilo.org/ilostat. Those methods are Population census, Official estimate, Administrative records, Population register, Household surveys, Labour force survey, Household income/expenditure survey or Other household survey.

Measures

All three indicators are constructed mainly on report from the member states.

The explanatory variable Internet Usage Rate (per 100 people):

Definition : Internet users are defined as individuals who used the Internet int he last 12 months.

Scale : 0 to 100

Management : Suppression of countries with no-data provided

The response variable Suicide, age adjusted, per 100 000 people:

Definition : Mortality due to self-inflicted injury, per 100 000 standard population, age adjusted. Combination of data from WHO Violence and Injury Prevention (VIP) and from WHO Global Burden of Disease.

Scale : 0 to 100 000

Management : Suppression of countries with no-data provided

Another explanatory variable Employment rate for people of age 15+:

Definiton : Percentage of total population, age above 15, that has been employed during the given year.

Scale : 0% to 100%

Management : Suppression of countries with no-data provided

0 notes

coursera2022 · 3 years ago

Text

K Means

I decided to use the same research question than for the previous assignment on Lasso regression analysis.

Using the Gapminder database, I would like to see the variables that are the most influencing income per person (2010 GDP per capita in constant 2000 US$). It will therefore be my test variable.

The cluster analysis will be carried out on the following variables:

Residential electricity consumption (per person in kWh)

CO2 emissions (in metric tons)

Employment rate (total employees age 15+ in % of the population)

Internet use rate (Internet users per 100 people)

Life expectancy (Life expectancy at birth in years)

Polity score (Democracy score; the scale ranges from -10 (the lowest) to 10 (the highest))

Urban rate (Urban population in %)

Data management

The countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.

In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans

In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename) data = data.set_index('country')

General information on the Gapminder data

In [3]:display(Markdown("Number of countries: {}".format(len(data))))

Number of countries: 213

Predictors selections and standardization

In [4]:explanatory_vars = ['employrate', 'urbanrate', 'polityscore', 'lifeexpectancy', 'internetuserate', 'relectricperperson'] test_var = 'incomeperperson' constructor_dict = dict() for var in explanatory_vars + [test_var, ]: constructor_dict[var] = pd.to_numeric(data[var], errors='coerce') numeric_data = pd.DataFrame(constructor_dict, index=data.index).dropna() display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(numeric_data))))

Number of countries after discarding countries with missing data: 122

In [5]:predictors = numeric_data[explanatory_vars] target = numeric_data[test_var] # Standardize predictors to have mean=0 and std=1 std_predictors = predictors.copy() for var in std_predictors.columns: std_predictors[var] = preprocessing.scale(std_predictors[var].astype('float64')) # Check standardization std_predictors.describe()

Out[5]:

0 notes

coursera2022 · 3 years ago

Text

Lasso regression

Using the Gapminder database, I would like to see the variables that are the most influencing the income per person (2010 GDP per capita in constant 2000 US$) from the following list of variables:

Residential electricity consumption (per person in kWh)

CO2 emissions (in metric tons)

Employ rate (total employees age 15+ in % of the population)

Internet use rate (Internet users per 100 people)

Life expectancy (Life expectancy at birth in years)

Polity score (Democracty score; the scale ranges from -10 (the lowest) to 10 (the highest))

Urban rate (Urban population in %)

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.

In [1]:# Magic command to insert the graph directly in the notebook%matplotlib inline # Load a useful Python libraries for handling dataimport pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Markdown, display from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

In [2]:# Read the data data_filename = r'gapminder.csv' data = pd.read_csv(data_filename) data = data.set_index('country')

General information on the Gapminder data

In [3]:display(Markdown("Number of countries: {}".format(len(data))))

Number of countries: 213

Predictors selections and standardization

In [4]:explanatory_vars = ['employrate', 'urbanrate', 'polityscore', 'lifeexpectancy', 'internetuserate', 'relectricperperson'] response_var = 'incomeperperson' constructor_dict = dict() for var in explanatory_vars + [response_var, ]: constructor_dict[var] = pd.to_numeric(data[var], errors='coerce') numeric_data = pd.DataFrame(constructor_dict, index=data.index).dropna() display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(numeric_data))))

Number of countries after discarding countries with missing data: 122

The number of countries has unfortunately been almost divided by two due to missing data. Nevertheless the k-fold cross-validation technique will be used.

In [5]:predictors = numeric_data[explanatory_vars] target = numeric_data[response_var] # Standardize predictors to have mean=0 and std=1from sklearn import preprocessing std_predictors = predictors.copy() for var in std_predictors.columns: std_predictors[var] = preprocessing.scale(std_predictors[var].astype('float64')) # Check standardization std_predictors.describe()

0 notes

coursera2022 · 3 years ago

Text

The blog for Random forest

The explanatory variables will be:

Age (Quantitative variable)

Sex (0 = Female, 1 = Male)

Family income (grouped in 5 categories)

Ever smoked 100+ cigarettes (0 = No, 1 = Yes)

Ethnicity (Each of them are recoded 0 = No, 1 = Yes):

White

Black

Native American

Pacific islander

Asian

Hispanic

Raised by who (Each of them are recoded 0 = No, 1 = Yes):

Biological parents (i.e. lives with at least one biological parent)

Adoptive parents

Relatives

Foster parents

In institution

Other

Did biological parents get divorced (0 = No, 1 = Yes)

Highest grade completed : coded from 1 (no schooling) to 14 (completed graduate)

Occupation (Each of them are recoded 0 = No, 1 = Yes):

Full time (35+ hours a week)

Part time (<35 hours a week)

Unemployed and looking for work

Unemployed and not looking for work

Data management

The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.

The response variable having 2 categories, categories grouping is not needed.

In [3]:from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier

In [4]:nesarc = pd.read_csv('nesarc_pds.csv') C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

In [45]:boolean_flip_ignore = {1 : 1, 2 : 0, 9 : np.nan} boolean_flip_ignore_str = {'1' : 1, '2' : 0, '9' : np.nan, ' ' : 0} boolean_flip = {1 : 1, 2 : 0} # Group the family income in two groups splitting it on the second quartile classdef income_group(value): if value < 5: return 0 elif value < 9: return 1 elif value < 13: return 2 elif value < 17: return 3 else: return 4 subnesarc = (nesarc[['AGE', 'SEX', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A', 'S1Q1C', 'S1Q1D1', 'S1Q1D2', 'S1Q1D3', 'S1Q1D4', 'S1Q1D5', 'S1Q2A', 'S1Q2C1', 'S1Q2C2', 'S1Q2C3', 'S1Q2C4', 'S1Q2C5', 'S1Q2D', 'S1Q6A', 'S1Q7A1', 'S1Q7A2', 'S1Q7A6', 'S1Q7A7']] .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(boolean_flip)), native=lambda x: pd.to_numeric(x['S1Q1D1'].map(boolean_flip)), asian=lambda x: pd.to_numeric(x['S1Q1D2'].map(boolean_flip)), black=lambda x: pd.to_numeric(x['S1Q1D3'].map(boolean_flip)), pacific=lambda x: pd.to_numeric(x['S1Q1D4'].map(boolean_flip)), white=lambda x: pd.to_numeric(x['S1Q1D5'].map(boolean_flip)), used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce') .map(boolean_flip_ignore)), family_income=lambda x: pd.to_numeric(x['S1Q11B'].map(income_group)), smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce') .map(boolean_flip_ignore)), bioparent=lambda x: (pd.to_numeric(x['S1Q2A'], errors='coerce') .map(boolean_flip_ignore)), adoptiveparent=lambda x: (pd.to_numeric(x['S1Q2C1'].map(boolean_flip_ignore_str), errors='coerce')), byrelative=lambda x: (pd.to_numeric(x['S1Q2C2'].map(boolean_flip_ignore_str), errors='coerce')), byfoster=lambda x: (pd.to_numeric(x['S1Q2C3'].map(boolean_flip_ignore_str), errors='coerce')), ininstitution=lambda x: (pd.to_numeric(x['S1Q2C4'].map(boolean_flip_ignore_str), errors='coerce')), otherraised=lambda x: (pd.to_numeric(x['S1Q2C5'].map(boolean_flip_ignore_str), errors='coerce')), divorced=lambda x: (pd.to_numeric(x['S1Q2D'].map(boolean_flip_ignore_str), errors='coerce')), grade=lambda x: (pd.to_numeric(x['S1Q6A'], errors='coerce')), fulltime=lambda x: pd.to_numeric(x['S1Q7A1'].map(boolean_flip)), parttime=lambda x: pd.to_numeric(x['S1Q7A2'].map(boolean_flip)), unemployedsearch=lambda x: pd.to_numeric(x['S1Q7A6'].map(boolean_flip)), unemployed=lambda x: pd.to_numeric(x['S1Q7A7'].map(boolean_flip)) ) .dropna())[['sex', 'AGE', 'native', 'asian', 'black', 'pacific', 'white', 'used_canabis', 'family_income', 'smoked_100cigarettes', 'bioparent', 'adoptiveparent', 'byrelative', 'byfoster', 'ininstitution', 'otherraised', 'divorced', 'grade', 'fulltime', 'parttime', 'unemployedsearch', 'unemployed']]

0 notes