Don't wanna be here? Send us removal request.
Text
Machine Learning Assignment4
Machine Learning Assignment4
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 11 variables that represent characteristics that could have an impact on school connectedness. Clustering variables included two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana, as well as quantitative variables measuring alcohol problems, a scale measuring engaging in deviant behaviors (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school), and scales measuring violence, depression, self-esteem, parental presence, parental activities, family connectedness, and grade point average. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
The elbow curve was inconclusive, suggesting that the 2, 5, 6, 7 and 8-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 2 and 3 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 4 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
The means on the clustering variables showed that, compared to the other clusters, adolescents in cluster 1 had moderate levels on the clustering variables. They had a relatively high likelihood of using alcohol or marijuana, alcohol problems, violence, deviant behaviour. Moderate depression lower self-esteem. They also appeared to have fairly low levels of school connectedness parental presence, gpa, parental involvement in activities and family connectedness.
Cluster 2 has a moderate likelihood of having used alcohol, less likelihood of used marijuana alcohol problems, deviant behavior, higher depression, lower gpa, parental presence, and family connectedness. Cluster 2 is moderate compared to cluster 1 and 4.
On the other hand, cluster 4 clearly included the most troubled adolescents. Adolescents in cluster 4 had the highest likelihood of having used alcohol, a very high likelihood of having used marijuana, more alcohol problems, and more engagement in deviant and violent behaviors compared to the other clusters. They also had higher levels of depression, lowest self-esteem, and the lowest levels of gpa, parental presence, involvement of parents in activities, and family connectedness.
Cluster 3 appeared to include the least troubled adolescents. Compared to adolescents in the other clusters, they were least likely to have used alcohol , marijuana, and had the lowest number of alcohol problems, and deviant and violent behavior. They also had the lowest levels of depression, and highest self-esteem, grade point average, parental presence, parental involvement in activities and family connectedness.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on school connectedness. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA (F(3, 3197)=255.40 p<.0001). The tukey post hoc comparisons showed significant differences between clusters on school connectedness, with the exception that clusters 1 and 2 were not significantly different from each other. Adolescents in cluster 3 had the highest school connectedness (mean=30.11, sd=4.28), and cluster 4 had the lowest GPA (mean=23.07, sd=5.79).
SAS Code
libname mydata "/courses/d1406ae5ba27fe300" access=readonly;
**************************************************************************************************************
DATA MANAGEMENT
**************************************************************************************************************;
data clust;
set mydata.treeaddhealth;
* create a unique identifier to merge cluster assignment variable with
the main data set;
idnum=_n_;
keep idnum alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1
parpres paractv famconct gpa1;
* delete observations with missing data;
if cmiss(of _all_) then delete;
run;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=clust out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
data clus_train;
set traintest;
if selected=1;
run;
data clus_test;
set traintest;
if selected=0;
run;
* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 gpa1
parpres paractv famconct;
run;
%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 gpa1
parpres paractv famconct;
run;
%mend;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;
data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;
data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;
data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;
data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;
data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;
data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;
data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;
data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;
data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9;
run;
* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;
run;
*****************************************************************************************
further examine cluster solution for the number of clusters suggested by the elbow curve
*****************************************************************************************
* plot clusters for 4 cluster solution;
proc candisc data=outdata4 out=clustcan;
class cluster;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 gpa1
parpres paractv famconct;
run;
proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
* validate clusters on;
school connectedness
* first merge clustering variable and assignment data with school connectedness data;
data schconn_data;
set clus_train;
keep idnum schconn1;
run;
proc sort data=outdata4;
by idnum;
run;
proc sort data=schconn_data;
by idnum;
run;
data merged;
merge outdata4 schconn_data;
by idnum;
run;
proc sort data=merged;
by cluster;
run;
proc means data=merged;
var schconn1;
by cluster;
run;
proc anova data=merged;
class cluster;
model schconn1 = cluster;
means cluster/tukey;
run;
0 notes
Text
Machine Learning Assignment 3
ML_Assignment3
SAS Code:
libname mydata "/courses/d1406ae5ba27fe300" access=readonly;
**************************************************************************************************************
DATA MANAGEMENT
**************************************************************************************************************;
data new;
set mydata.treeaddhealth;;
if bio_sex=1 then male=1;
if bio_sex=2 then male=0;
* delete observations with missing data;
if cmiss(of _all_) then delete;
run;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=new out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
* lasso multiple regression with lars algorithm k=10 fold validation;
* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
partition ROLE=selected(train='1' test='0');
model gpa1 = male hispanic white black namerican asian alcevr1 marever1 cocever1
inhever1 cigavail passist expel1 age alcprobs1 deviant1 viol1 dep1 esteem1 parpres paractv
famconct schconn1/selection=lar(choose=cv stop=none) cvmethod=random(10);
run;
Explanation
A lasso regression analysis was conducted to identify a subset of variables from a pool of 23 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring grade point average in adolescents. Categorical predictors included gender and a series of 5 binary categorical variables for race and ethnicity (Hispanic, White, Black, Native American and Asian). Binary substance use variables were measured with individual questions about whether the adolescent had ever used alcohol, marijuana, cocaine or inhalants. Additional categorical variables included the availability of cigarettes in the home, whether or not either parent was on public assistance and any experience with being expelled from school. Quantitative predictor variables include age, school connectedness, alcohol problems, and a measure of deviance that included such behaviors as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school. Another scale for violence, one for depression, and others measuring self-esteem, parental presence, parental activities, family connectedness and were also included. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
During the estimation process, school connectedness and violent behavior were most strongly associated with grade point average, followed by parental activities and male. Being male and violent behavior were negatively associated with grade point average, and school connectedness and parental activities were positively associated with grade point average. Other predictors associated with greater gpa included self esteem, Asian and White ethnicity, family connectedness, and parental involvement in activities. Other predictors associated with lower gpa included being Black and Hispanic ethnicities, age, alcohol, marijuana use, depression, availability of cigarettes at home, deviant behavior, parent was on public assistance, and history of being expelled from school. These 16 variables were selected for grade point average response variable. Ethnicity ASIAN was selected as best model.
Result
0 notes
Text
Machine Learning – Assignment 2
Machine Learning – Assignment 2
Program
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.treeaddhealth;
PROC SORT; BY AID;
PROC HPFOREST;
target marever1/level=nominal;
input BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN alcevr1 cocever1 inhever1
Cigavail PASSIST EXPEL1 /level=nominal;
input age DEVIANT1 VIOL1 DEP1 ESTEEM1 PARPRES PARACTV
FAMCONCT schconn1 GPA1 /level=interval;
RUN;
Results
Explanation
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating marijuana smoking (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Alcohol use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school, alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.
The explanatory variables with the highest relative importance scores were alcohol use, deviance , cocaine use and inhalant use. The accuracy of the random forest was 78.3%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.
0 notes
Text
Machine Learning - Assignment1
Machine Learning - Assignment1
Program
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.treeaddhealth;
PROC SORT; BY AID;
ods graphics on;
proc hpsplit seed=15531;
class marever1 BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN
alcevr1 cocever1 inhever1 Cigavail EXPEL1 ;
model marever1 =AGE BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN alcevr1 ALCPROBS1
marever1 cocever1 inhever1 DEVIANT1 VIOL1 DEP1 ESTEEM1 PARPRES PARACTV
FAMCONCT schconn1 Cigavail PASSIST EXPEL1 GPA1;
grow entropy;
prune costcomplexity;
RUN;
Results
Explanation:
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.
The following explanatory variables were included as possible contributors to a classification tree model evaluating marijuana smokers (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Alcohol use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.
There were a total of 6504 rows but only 4576 were used. There are a total of 11 nodes and 6 leafs.
The alcohol use was the first variable to separate the sample into two subgroups. Which is almost similar percentage of alcohol users and non-alcohol users.
For the adolescent with no alcohol use, it was split into two – with cocaine users and non-cocaine users. The non-cocaine users with no alcohol use there were 106 Marijuana users. The cocaine users with no alcohol use there were 21 Marijuana users.
For the adolescents with alcohol use it was split into two on deviance score >= 4.05 and < 4.05. The adolescents with deviance score >=4.05 were divided into cocaine users and cocaine non users. The cocaine users, with alcohol use and with deviant score >=4.05, there were 89 marijuana users. The cocaine non-users with alcohol use with deviant score>=4.05, there were 404 marijuana users, this is the highest number of Marijuana users.
The adolescents with deviant score <4.05 were divided into alcohol problems score <.06 and >=.06. The users with alcohol problems score< .06, alcohol users, with deviant score<4.05, there were 253 marijuana users. The users with alcohol problems >= .06, alcohol users, with deviant score<4.05, there were 227 marijuana users.
The total model correctly classifies 53% of those who have used marijuana and 93% of those who have not. So we can better predict non marijuana smokers than marijuana smokers. 93% sensitivity and 53% specificity.
Maximum Marijuana users 404 were alcohol users, with deviant score >= 4.05 and no cocaine use.
Next Highest were 253 marijuana users with alcohol use, deviant score < 4.05 with alcohol problems < .06
Next were 227 marijuana users with alcohol use, deviant score is < 4.05, alcohol problems is >= .06
Lowest were 21 marijuana users with no alcohol use, and cocaine users.
0 notes
Text
Regression Models – Assignment 4
Regression Models – Assignment 4
The research question is whether alcohol dependence is related to Major depression and panic disorder.
Result1 – logictic regression for alcohol dependence and abuse with major depression
Here we have generated the logistic regression model – the response variable is alcohol dependency and explanatory variable is major depression. Both of the variables have a binary values of 0 or 1. There are a total of 43,093 observations, p-value is extremely low so it is significant. For MAJORDEPLIFE the p-value is very low so it is statistically significant.
# logistic regression with major depression
lreg1 = smf.logit(formula = 'alcdep ~ MAJORDEPLIFE', data = mydata).fit()
print (lreg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: alcdep No. Observations: 43093
Model: Logit Df Residuals: 43091
Method: MLE Df Model: 1
Date: Sat, 13 Feb 2021 Pseudo R-squ.: 0.01817
Time: 20:19:32 Log-Likelihood: -24878.
converged: True LL-Null: -25339.
Covariance Type: nonrobust LLR p-value: 2.893e-202
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -1.1360 0.012 -91.486 0.000 -1.160 -1.112
MAJORDEPLIFE 0.8035 0.026 30.845 0.000 0.752 0.855
================================================================================
Result2 – Odds Ratios
The odds ratio is the probability of an event occurring in one group compared to the probability of an event occurring in another group. Here the odds ratio is 2.23 which means that people with major depression are 2.23 times more likely to have alcohol dependence compared to people without major depression.
# odds ratios
print ("Odds Ratios")
print (numpy.exp(lreg1.params))
Odds Ratios
Intercept 0.32
MAJORDEPLIFE 2.23
Result 3 – Confidence Intervals For Odds ratio
The confidence interval for odds ratio is 2.12 to 2.35 which means that people with major depression are between 2.12 to 2.35 times more likely to have alcohol dependence, compared to people without major depression. The odds ratio is a sample statistic and the confidence intervals are an estimate of the population parameter.
#%%
# odd ratios with 95% confidence intervals
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))
Lower CI Upper CI OR
Intercept 0.31 0.33 0.32
MAJORDEPLIFE 2.12 2.35 2.23
Result4 – Regression with major depression + panic disorder
Both major depression and panic disorder are positively associated with the likelihood of alcohol dependence. The odds ratio for major depression is 2.08 which means that people with depression are 2.08 time more likely to be alcohol dependent than people without major depression after controlling for panic disorder. The odds ratio for panic disorder is 1.61 which means that people with panic disorder are 1.61 times more likely to be alcohol dependent than people without panic disorder after controlling for major depression.
Because the confidence intervals on our odds ratios overlap, we cannot say that major depression is more strongly associated with alcohol dependence than the panic disorder.
For the population we can say that those with major depression are anywhere between 1.97 and 2.19 times more likely to have alcohol dependence than those without major depression. And those with panic disorder are between 1.48 and 1.77 times more likely to have alcohol dependence than those without panic disorder. Here there was no evidence of confounding by variable panic for the association between major depression and alcohol dependence variable.
# logistic regression with major depression + panic disorder
lreg2 = smf.logit(formula = 'alcdep ~ MAJORDEPLIFE + panic', data = mydata).fit()
print (lreg2.summary())
#%%
# odd ratios with 95% confidence intervals
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: alcdep No. Observations: 43093
Model: Logit Df Residuals: 43090
Method: MLE Df Model: 2
Date: Sat, 13 Feb 2021 Pseudo R-squ.: 0.02030
Time: 21:05:23 Log-Likelihood: -24824.
converged: True LL-Null: -25339.
Covariance Type: nonrobust LLR p-value: 4.377e-224
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -1.1500 0.013 -91.920 0.000 -1.174 -1.125
MAJORDEPLIFE 0.7321 0.027 27.113 0.000 0.679 0.785
panic 0.4790 0.046 10.502 0.000 0.390 0.568
Lower CI Upper CI OR
Intercept 0.31 0.32 0.32
MAJORDEPLIFE 1.97 2.19 2.08
panic 1.48 1.77 1.61
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 13 19:07:34 2021
@author: GB8PM0
"""
import numpy
import pandas
import statsmodels.api as sm
import seaborn
import statsmodels.formula.api as smf
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%.2f'%x)
mydata = pandas.read_csv('nesarc_pds.csv', low_memory=False)
##############################################################################
# DATA MANAGEMENT
##############################################################################
#setting variables you will be working with to numeric
mydata['ALCABDEP12DX'] =pandas.to_numeric(mydata['ALCABDEP12DX'], errors='coerce')
mydata['ALCABDEPP12DX'] = pandas.to_numeric(mydata['ALCABDEPP12DX'], errors='coerce')
mydata['MAJORDEPLIFE'] = pandas.to_numeric(mydata['MAJORDEPLIFE'], errors='coerce')
mydata['PANLIFE'] = pandas.to_numeric(mydata['PANLIFE'], errors='coerce')
mydata['APANLIFE'] = pandas.to_numeric(mydata['APANLIFE'], errors='coerce')
data['SOCPDLIFE'] = pandas.to_numeric(data['SOCPDLIFE'], errors='coerce')
recode2 = {0: 0, 1: 1, 2: 1, 3: 1,}
# alcohohol abuse/dependence in last 12 months
mydata['alc12mth'] = mydata['ALCABDEP12DX'].map(recode2)
chk1 = mydata['alc12mth'].value_counts(sort=False, dropna=False)
print (chk1)
# alcohohol abuse/dependence in last 12 months
mydata['alcpr12mth'] = mydata['ALCABDEPP12DX'].map(recode2)
chk2= mydata['alcpr12mth'].value_counts(sort=False, dropna=False)
print (chk2)
# alcohol abuse/dependence
mydata['alctot'] = mydata['alcpr12mth'] + mydata['alc12mth']
# binary alcohol abuse/dependence
def alcdep (row):
if row['alctot'] == 0:
return 0
else:
return 1
mydata['alcdep']= mydata.apply (lambda row: alcdep(row), axis=1 )
chk3= mydata['alcdep'].value_counts(sort=False, dropna=False)
print (chk3)
# panic disorder
mydata['pantot'] = mydata['PANLIFE'] + mydata['APANLIFE']
# binary alcohol abuse/dependence
def panic (row):
if row['pantot'] == 0:
return 0
else:
return 1
mydata['panic']= mydata.apply (lambda row: panic(row), axis=1 )
chk4= mydata['panic'].value_counts(sort=False, dropna=False)
print (chk4)
#%%
##############################################################################
# END DATA MANAGEMENT
##############################################################################
##############################################################################
# LOGISTIC REGRESSION
##############################################################################
# logistic regression with major depression
lreg1 = smf.logit(formula = 'alcdep ~ MAJORDEPLIFE', data = mydata).fit()
print (lreg1.summary())
#%%
# odds ratios
print ("Odds Ratios")
print (numpy.exp(lreg1.params))
#%%
# odd ratios with 95% confidence intervals
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))
#%%
# logistic regression with major depression + panic disorder
lreg2 = smf.logit(formula = 'alcdep ~ MAJORDEPLIFE + panic', data = mydata).fit()
print (lreg2.summary())
#%%
# odd ratios with 95% confidence intervals
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))
0 notes
Text
Regression Models – Assignment 3
Regression Models – Assignment 3
The relationship between breast cancer rate and urbanrate is being studied.
Result 1- OLS regression model between breast cancer and Urban rate
The OLS regression model shows that the breast cancer rate increases with the urban rate. And we can use the formula breastcancerate = 37.52 + .611 * urbanrate for the linear relationship. But the r-squared is .354% indicating that the linear association is capturing 35.4% variability in breast cancer rate.
print ("after centering OLS regression model for the association between urban rate and breastcancer rate ")
reg1 = smf.ols('breastcancerrate ~ urbanrate_c', data=sub1).fit()
print (reg1.summary())
after centering OLS regression model for the association between urban rate and breastcancer rate
OLS Regression Results ==============================================================================
Dep. Variable: breastcancerrate R-squared: 0.354
Model: OLS Adj. R-squared: 0.350
Method: Least Squares F-statistic: 88.32
Date: Thu, 11 Feb 2021 Prob (F-statistic): 5.36e-17
Time: 15:13:54 Log-Likelihood: -706.75
No. Observations: 163 AIC: 1417.
Df Residuals: 161 BIC: 1424.
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 37.5153 1.457 25.752 0.000 34.638 40.392
urbanrate_c 0.6107 0.065 9.398 0.000 0.482 0.739
==============================================================================
Omnibus: 5.949 Durbin-Watson: 1.705
Prob(Omnibus): 0.051 Jarque-Bera (JB): 5.488
Skew: 0.385 Prob(JB): 0.0643
Kurtosis: 2.536 Cond. No. 22.4
==============================================================================
Result 2 - OLS regression model after adding more explanatory variables
Now I added confounders female employ rate, CO2 emission rate and alcohol consumption to the model. The p-value for CO2 emission rate and alcohol consumption have a low p-value, <.05 and we can say that both of these are positively associated with breast cancer rate. Countries with higher urban rate have 0.49 more breast cancer rate. In a population, there's a 95% chance that countries with higher urbanrate will have between 0.36 and .62 more breast cancer rate
The female employ rate p-value is 0.775, and not significant so we cannot reject null hypothesis – no association between female employ and breast cancer rate. The 95% confidence interval for female employ rate is between -0.164 and 0.220, which means we can say with 95% confidence level that there may be a 0 breast cancer association in a population.
print ("after centering OLS regression model for the association between famele employed, co2emissions, alcohol consumption,urban rate and breastcancer rate ")
reg2= smf.ols('breastcancerrate ~ urbanrate_c + femaleemployrate_c + co2emissions_c + alcconsumption_c', data=sub1).fit()
print (reg2.summary())
after centering OLS regression model for the association between famele employed, co2emissions, alcohol consumption,urban rate and breastcancer rate
OLS Regression Results ==============================================================================
Dep. Variable: breastcancerrate R-squared: 0.495
Model: OLS Adj. R-squared: 0.482
Method: Least Squares F-statistic: 38.74
Date: Thu, 11 Feb 2021 Prob (F-statistic): 1.43e-22
Time: 15:13:59 Log-Likelihood: -686.69
No. Observations: 163 AIC: 1383.
Df Residuals: 158 BIC: 1399.
Df Model: 4
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 37.5153 1.300 28.852 0.000 34.947 40.083
urbanrate_c 0.4925 0.066 7.413 0.000 0.361 0.624
femaleemployrate_c 0.0278 0.097 0.286 0.775 -0.164 0.220
co2emissions_c 1.389e-10 4.68e-11 2.971 0.003 4.66e-11 2.31e-10
alcconsumption_c 1.5061 0.282 5.349 0.000 0.950 2.062
==============================================================================
Omnibus: 2.965 Durbin-Watson: 1.832
Prob(Omnibus): 0.227 Jarque-Bera (JB): 2.942
Skew: 0.324 Prob(JB): 0.230
Kurtosis: 2.883 Cond. No. 2.83e+10
Result 3 – Scatter plot for linear and Quadratic model
Here I created one linear scatter plot and another one with quadratic to see if the relation is curvilinear. May it is quadratic.
# plot a scatter plot for linear
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", fit_reg=True, data=sub1)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
# plot a scatter plot# add the quadratic portion to see if it is curvilinear
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", order=2, fit_reg=True, data=sub1)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
Result 4 - OLS regression model with quadratic
Here the R-squared has improved from .354 to .371. A slightly higher value.
print ("after centering OLS regression model for the association between urban rate and breastcancer rate and adding qudratic portion")
reg3= smf.ols('breastcancerrate ~ urbanrate_c + I(urbanrate_c**2)', data=sub1).fit()
print (reg3.summary())
after centering OLS regression model for the association between urban rate and breastcancer rate and adding qudratic portion
OLS Regression Results ==============================================================================
Dep. Variable: breastcancerrate R-squared: 0.371
Model: OLS Adj. R-squared: 0.363
Method: Least Squares F-statistic: 47.13
Date: Fri, 12 Feb 2021 Prob (F-statistic): 8.06e-17
Time: 07:52:49 Log-Likelihood: -704.64
No. Observations: 163 AIC: 1415.
Df Residuals: 160 BIC: 1425.
Df Model: 2
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 34.6381 2.014 17.201 0.000 30.661 38.615
urbanrate_c 0.6250 0.065 9.656 0.000 0.497 0.753
I(urbanrate_c ** 2) 0.0057 0.003 2.048 0.042 0.000 0.011
==============================================================================
Omnibus: 3.684 Durbin-Watson: 1.705
Prob(Omnibus): 0.159 Jarque-Bera (JB): 3.734
Skew: 0.345 Prob(JB): 0.155
Kurtosis: 2.728 Cond. No. 1.01e+03
Result 5 - OLS regression model with quadratic and another variable
Here I have added 2nd order urban rate and alcohol consumption rate to the model. The p-value for alcohol consumption is significant meaning there is association between alcohol consumption and breast cancer rate. The R-squared has increased to 49.8%, slightly improved from previous.
print (“after centering OLS regression model for the association between urban rate and breastcancer rate after adding alcohol consumption rate “)
reg4 = smf.ols(‘breastcancerrate ~ urbanrate_c + I(urbanrate_c**2) + alcconsumption_c ‘, data=sub1).fit()
print (reg4.summary())
after centering OLS regression model for the association between urban rate and breastcancer rate after adding alcohol consumption rate
OLS Regression Results
Dep. Variable: breastcancerrate R-squared: 0.498
Model: OLS Adj. R-squared: 0.488
Method: Least Squares F-statistic: 52.51
Date: Fri, 12 Feb 2021 Prob (F-statistic): 1.21e-23
Time: 08:49:00 Log-Likelihood: -686.27
No. Observations: 163 AIC: 1381.
Df Residuals: 159 BIC: 1393.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 33.4837 1.814 18.459 0.000 29.901 37.066
urbanrate_c 0.5203 0.060 8.627 0.000 0.401 0.639
I(urbanrate_c ** 2) 0.0080 0.003 3.169 0.002 0.003 0.013
alcconsumption_c 1.7136 0.270 6.340 0.000 1.180 2.247
Omnibus: 2.069 Durbin-Watson: 1.691
Prob(Omnibus): 0.355 Jarque-Bera (JB): 1.645
Skew: 0.216 Prob(JB): 0.439
Kurtosis: 3.235 Cond. No. 1.01e+03
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.01e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Result 6 - QQPlot
In this regression model, the residual is the difference between the predicted breast cancer rate, and the actual observed breast cancer rate for each country. Here I have plotted the residuals. The qqplot for our regression model shows that the residuals generally follow a straight line, but deviate at the lower and higher quantiles. This indicates that our residuals did not follow perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic urban rate term. There might be other explanatory variables that we might consider including in our model, that could improve estimation of the observed curvilinearity.
#Q-Q plot for normality
fig4=sm.qqplot(reg4.resid, line='r')
Result 7 – Plot of Residuals
Here there is 1 point below -3 standard deviation of residuals, 3 points (1.84%) above 2.5 standard deviation of residuals, and 8 points (4.9%) above 2 standard deviation of residuals. So there are 1.84% of points above 2.5, and 4.9% points above 2.0 std deviation. This suggests that the fit of the model is relatively poor and could be improved. May be there are other explanatory variables that could be included in the model.
# simple plot of residuals
stdres=pandas.DataFrame(reg4.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
Result 8 – Residual Plot and Partial Regression Plot
The plot in the upper right hand corner shows the residuals for each observation at different values of alcohol consumption. It looks like the absolute values of residual are significantly higher at lower alcohol consumption rate and get smaller closer to 0 as alcohol consumption increases, but then again increase at higher levels of alcohol consumption. This model does not predict breast cancer rate as well for countries that have either high or low levels of alcohol consumption rate. May be there is a curvilinear relation between alcohol consumption rate and breast cancer rate.
The partial regression plot which is in the lower left hand corner. The residuals are spread out in a random pattern around the partial regression line and in addition many of the residuals are pretty far from this line, indicating a great deal of breast cancer rate prediction error. Although alcohol consumption rate shows a statistically significant association with breast cancer rate,this association is pretty weak after controlling for urban rate.
# additional regression diagnostic plots
fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg4, "alcconsumption_c", fig=fig2)
Result 9 – Leverage Plot
Here in the leverage plot there are few outliers on the left side with a std deviation greater than 2, but they are close to 0 meaning they do not have much leverage. There is one in the right which is 199 which is significant.
# leverage plot
fig3=sm.graphics.influence_plot(reg4, size=8)
print(fig3)
Complete Python Code
# -*- coding: utf-8 -*-
"""
Created on Mon Feb 8 20:37:33 2021
@author: GB8PM0
"""
#%%
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
data = pandas.read_csv('C:\Training\Data Analysis\gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric
data['breastcancerrate'] = pandas.to_numeric(data['breastcancerper100th'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce')
data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce')
# listwise deletion of missing values
sub1 = data[['urbanrate', 'breastcancerrate','femaleemployrate', 'co2emissions','alcconsumption' ]].dropna()
# center the explanatory variable urbanrate by subtracting mean
sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['femaleemployrate_c'] = (sub1['femaleemployrate'] - sub1['femaleemployrate'].mean())
sub1['co2emissions_c'] = (sub1['co2emissions'] - sub1['co2emissions'].mean())
sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean())
print ("after centering OLS regression model for the association between urban rate and breastcancer rate ")
reg1 = smf.ols('breastcancerrate ~ urbanrate_c', data=sub1).fit()
print (reg1.summary())
#%%
#after adding other variables
print ("after centering OLS regression model for the association between famele employed, co2emissions, alcohol consumption,urban rate and breastcancer rate ")
reg2= smf.ols('breastcancerrate ~ urbanrate_c + femaleemployrate_c + co2emissions_c + alcconsumption_c', data=sub1).fit()
print (reg2.summary())
#%%
# plot a scatter plot for linear
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", fit_reg=True, data=sub1)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
# plot a scatter plot# add the quadratic portion to see if it is curvilinear
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", order=2, fit_reg=True, data=sub1)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
#%%
# add the quadratic portion to see if it is curvilinear
print ("after centering OLS regression model for the association between urban rate and breastcancer rate and adding qudratic portion")
reg3= smf.ols('breastcancerrate ~ urbanrate_c + I(urbanrate_c**2)', data=sub1).fit()
print (reg3.summary())
#%%
print ("after centering OLS regression model for the association between urban rate and breastcancer rate after adding alcohol consumption rate ")
reg4 = smf.ols('breastcancerrate ~ urbanrate_c + I(urbanrate_c**2) + alcconsumption_c ', data=sub1).fit()
print (reg4.summary())
#%%
#Q-Q plot for normality
fig4=sm.qqplot(reg4.resid, line='r')
#%%
# simple plot of residuals
stdres=pandas.DataFrame(reg4.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
#%%
# additional regression diagnostic plots
fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg4, "alcconsumption_c", fig=fig2)
#%%
# leverage plot
fig3=sm.graphics.influence_plot(reg4, size=8)
print(fig3)
0 notes
Text
Regression Models – Assignment2
Regression Models – Assignment2
Relation between breast cancer rate and urban rate.
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Mon Feb 8 20:37:33 2021
@author: GB8PM0
"""
#%%
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
import statsmodels.api
import statsmodels.formula.api as smf
data = pandas.read_csv('C:\Training\Data Analysis\gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric
data['breastcancerrate'] = pandas.to_numeric(data['breastcancerper100th'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
# listwise deletion of missing values
sub1 = data[['urbanrate', 'breastcancerrate']].dropna()
# plot a scatter plot
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", fit_reg=True, data=sub1)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
#regression model between urbanrate and breast cancer rate ist variable response
print ("before centering OLS regression model for the association between urban rate and breastc ancer rate")
reg1 = smf.ols('breastcancerrate ~ urbanrate', data=sub1).fit()
print (reg1.summary())
# center the explanatory variable urbanrate by subtracting mean
sub1['newurbanrate'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1 = sub1[['urbanrate', 'newurbanrate', 'breastcancerrate']]
print(sub1.head(n=10))
print('new mean')
ds2 = sub1['newurbanrate'].mean()
print (ds2)
print ("after centering OLS regression model for the association between urban rate and breastcancer rate ")
reg2 = smf.ols('breastcancerrate ~ newurbanrate', data=sub1).fit()
print (reg2.summary())
Result:
OLS Regression Results
==============================================================================
Dep. Variable: breastcancerrate R-squared: 0.325
Model: OLS Adj. R-squared: 0.321
Method: Least Squares F-statistic: 82.00
Date: Tue, 09 Feb 2021 Prob (F-statistic): 3.12e-16
Time: 08:44:27 Log-Likelihood: -746.80
No. Observations: 172 AIC: 1498.
Df Residuals: 170 BIC: 1504.
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 37.2808 1.426 26.139 0.000 34.465 40.096
newurbanrate 0.5616 0.062 9.055 0.000 0.439 0.684
==============================================================================
Omnibus: 7.683 Durbin-Watson: 1.683
Prob(Omnibus): 0.021 Jarque-Bera (JB): 7.804
Skew: 0.489 Prob(JB): 0.0202
Kurtosis: 2.636 Cond. No. 23.0
Scatterplot
Explanation
The explanatory variable urbanrate is centered by subtracting the mean. The response variable is breast cancer rate. There were total of 172 observations in the data. The F-statistic is 82.0. The P value = 3.12e-16 is very small, considerably less than our alpha level of .05, which tells us that we can reject the null hypothesis and conclude that urban rate is significantly associated with breast cancer rate.
The coefficient for urbanrate is 0.562, and the intercept (y intercept) is 37.28. So the model equation becomes breastcancerrate = 37.28 + .562 * urbanrate, but this is estimated equation. We have p greater than absolute value of t (P>|t|) of 0.000, which gives us the p value for our explanatory variables, association with the response variable.
The R-Squared value is 0.325. It is the proportion of the variance in the response variable that can be explained by the explanatory variable. We now know that this model accounts for about 32.5% of the variability we see in our response variable, breast Cancer rate. The breast cancer rate increases with urban rate.
0 notes
Text
Regression Models – Assignment1
Regression Models – Assignment1
Sample:
Gapminder was started in 2005. Gapminder has data for 192 UN member countries and for additional 24 other areas with a total of 215 areas. It has about 400 indicators on global development indicators including income per person, alcohol consumption, total employment rate, estimated HIV prevalence, Breast cancer rate, female employed rate, urban rate etc.
Procedures:
Gapminder data comes from multiple sources – World Bank, Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics division. Sometimes they conduct survey of people from various countries.
Measures:
The are several measures in the gapminder data set. The urban rate is the percentage of population living in urban areas and this data was collected from UN in 2008. Breast cancer rate was collected in 2002 and it is the number of new cases per 100,000 females. This data was collected from International Agency for Research on Cancer. Femaleemploy rate is the percentage of the female population employed.
I tried to research if the breastcancer rates increased with the increase in urban rate. Do urban areas have larger breast cancer rates than the areas that are not urban. Then I thought may be the variable femaleemloyrate also effects the breastcancerrate, so I binned femaleemploy rate into three categories low, medium, high.
0 notes
Text
Data Analysis Tools - Assignment 4
Data Analysis Tools - Assignment 4
Here I will be using the gapminder data. I will comparing the breast cancer rates with respect to the urban rate. I will be using the moderator femaleemployrate to determine if the country’s female employ rate if it has any relation to the breast cancer rate. I will be categorizing the female emply rate into 3 categories 1=low, 2=medium and 3-high.
Python Code
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 3 15:33:50 2021
@author: GB8PM0
"""
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 2 14:38:01 2021
@author: GB8PM0
"""
#%%
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric
data['breastcancerrate'] = pandas.to_numeric(data['breastcancerper100th'], errors='coerce')
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['breastcancerrate']=data['breastcancerrate'].replace(' ', numpy.nan)
data['femaleemployrate']=data['femaleemployrate'].replace(' ', numpy.nan)
data_clean=data.dropna()
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", fit_reg=True, data=data)
plt.xlabel('urbanrate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot for the Association Between urbanrate Rate and breastcancerrate')
print ('association between urbanrate and breastcancer rate')
print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['breastcancerrate']))
#%%
#creating three categories for femaleemployrate
def femaleemploycatg (row):
if row['femaleemployrate'] <= 28:
return 1
elif row['femaleemployrate'] > 28 and row['femaleemployrate'] <= 55:
return 2
elif row['femaleemployrate'] > 55:
return 3
data_clean['femaleemploycatg'] = data_clean.apply (lambda row: femaleemploycatg (row),axis=1)
chk1 = data_clean['femaleemploycatg'].value_counts(sort=False, dropna=False)
print(chk1)
sub1=data_clean[(data_clean['femaleemploycatg']== 1)]
sub2=data_clean[(data_clean['femaleemploycatg']== 2)]
sub3=data_clean[(data_clean['femaleemploycatg']== 3)]
print ('association between urbanrate and breastcancerrate for low female employ rate countries')
print (scipy.stats.pearsonr(sub1['urbanrate'], sub1['breastcancerrate']))
print (' ')
scat1 = seaborn.regplot(x="urbanrate", y="breastcancerrate", data=sub1)
plt.xlabel('Urban Rate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot Urban Rate and breastcancerrate for low female employ rate countries')
print (scat1)
#%%
print ('association between urbanrate and breastcancerrate for medium female employ rate countrie')
print (scipy.stats.pearsonr(sub2['urbanrate'], sub2['breastcancerrate']))
print (' ')
scat3 = seaborn.regplot(x="urbanrate", y="breastcancerrate", data=sub2)
plt.xlabel('Urban Rate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot Urban Rate and breastcancerrate for medium female employ rate countries')
print (scat3)
#%%
print ('association between urbanrate and breastcancerrate for high female employ rate countrie')
print (scipy.stats.pearsonr(sub3['urbanrate'], sub3['breastcancerrate']))
print (' ')
scat3 = seaborn.regplot(x="urbanrate", y="breastcancerrate", data=sub3)
plt.xlabel('Urban Rate')
plt.ylabel('breastcancerrate')
plt.title('Scatterplot Urban Rate and breastcancerrate for high female employ rate countries')
print (scat3)
Results:
runfile('C:/Training/Data Analysis/Assignment4.py', wdir='C:/Training/Data Analysis')
association between urbanrate and breastcancer rate
(0.5818686321241642, 1.6316781907186673e-16)
1 17
2 105
3 45
Name: femaleemploycatg, dtype: int64
association between urbanrate and breastcancerrate for low female employ rate countries
(-0.05216618996434125, 0.8423894883651072)
association between urbanrate and breastcancerrate for medium female employ rate countrie
(0.5367838916713512, 3.5759994882098993e-09)
association between urbanrate and breastcancerrate for high female employ rate countrie
(0.7672473708543266, 7.877009591193359e-10)
Summary
Here is the result of pearson coefficient for urban rate and breast cancer rate from all the data. Here it looks like the r=.58 and p=1.63e-16, shows that there is a moderate positive relation between breast cancer rate and urbanrate. A small p value shows that this could not have happened only by chance. The graph shows that in countries where urban rate is higher there is higher breast rates.
runcell(1, 'C:/Training/Data Analysis/Assignment4.py')
association between urbanrate and breastcancer rate
(0.5818686321241642, 1.6316781907186673e-16)
Next, I want to see if increase in breast cancer rate is because of a third variable female employ rate.
I categorized the female employ rate into 3 categories – 1 = less than 28, 2 = greater than 28 and less than or equal to 55, and 3 = greater than 55.
Low Female Employ Rate
The association between urbanrate and breastcancerrate for low female employ rate countries were calculated. Here the pearson coefficient = -0.052 with a p value= 0.842. The r = -0.052 shows a very weak negative relationship, but p is greater than .05, so it is not significant, may be this could have happened by chance.
association between urbanrate and breastcancerrate for low female employ rate countries
(-0.05216618996434125, 0.8423894883651072)
Here is the graph showing association between urban rate and breast cancer rate for countries with low female employ rate. In countries with low female employ rate it looks like breast cancer rate does not have any relation to urban rate.
Medium Female Employ Rate
The category is medium when female employ rate is between 28 and 55. The association between urban rate and breast cancer rate for countries with medium female employ rate was calculated. Here the pearson coefficient is 0.54 with a p value=3.57E-09. The r=0.54 shows a moderate positive relationship. The low p value is significant and could not have happened by chance alone.
association between urbanrate and breastcancerrate for medium female employ rate countrie
(0.5367838916713512, 3.5759994882098993e-09)
Here is the graph showing association between urban rate and breast cancer rate for countries with medium female employ rate. In countries with medium female employ rate it looks like breast cancer rate increases with the urban rate.
High Female Employ Rate
The category is high when female employ rate is greater than 55. The association between urban rate and breast cancer rate for countries with high female employ rate were calculated. Here the pearson coefficient 0.77 with a p value=7.88E-10. The r=0.77 shows a high positive relationship. The low p value is significant and could not have happened by chance alone.
association between urbanrate and breastcancerrate for high female employ rate countrie
(0.7672473708543266, 7.877009591193359e-10)
Here is the graph showing association between urban rate and breast cancer rate for countries with High female employ rate. In countries with high female employ rate it looks like breast cancer rate increases with the urban rate.
Conclusion: The overall graph shows that there is a positive relation between breast cancer rate and urban rate. The countries which are more urban have higher breast cancer rate. But this association is greatest in countries with high female employed rate and moderate in countries with medium female employed rate.
0 notes
Text
Data Analysis Tools - Assignment 3
Data Analysis Tools - Assignment 3
The pearson coefficient is calculated when both explanatory and response variables are quantitative. Here I will be using the gapminder and comparing Incomepersperson to alconsumption
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 2 14:38:01 2021
@author: GB8PM0
"""
#%%
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric
data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce')
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan)
scat1 = seaborn.regplot(x="incomeperperson", y="alcconsumption", fit_reg=True, data=data)
plt.xlabel('incomeperperson')
plt.ylabel('alcconsumption')
plt.title('Scatterplot for the Association Between incomeperperson Rate and alcconsumption Rate')
data_clean=data.dropna()
print ('association between incomeperperson and alcohol consumption rate')
print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['alcconsumption']))
Results
runcell(1, 'C:/Training/Data Analysis/Assignment3.py')
association between incomeperperson and alcohol consumption rate
(0.2879846191669146, 0.00013372003550947488)
Explanation
Here the r = 0.288 and p = 0.00013. Pearson Coefficient of r = 0.288 tells us that there is a positive relationship between incomeperperson and alcohol consumption. As the incomeperperson increases, alcohol consumption also increases. As r=0.288 is much closer to 0 than to 1, we can say that the relation is weak.
A value of p =.000133 is less than .05, so this says that it is highly unlikely that the relationship would be due to chance alone.
The r square = 0.0829 This could be interpreted the following way. If we know the incomeperperson, we can predict 8.29% of the variability in the rate alcohol consumption use. Of course, that also means that 92% of the variability is unaccounted for.
0 notes
Text
Data Analysis Tools - Assignment2
Data Analysis Tools
Assignment 2
I am using addhealth data and checking if there is relationship between ethnicity and selfesteem. I am using Explanatory categorical variable H1GI8 for ethnicity, and response variable H1PF33. Used data management strategies to create two levels. Adolescents with types of 1 & 2 were given a ‘good selfesteem’ and 4 & 5 ‘bad selfesteem’.
Null Hypothesis – no relation between ethnicity and selfesteem
Alternate Hypothesis : there is relation between ethnicity and self esteem
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Fri Jan 29 15:47:38 2021
@author: GB8PM0
"""
# import pandas and numpy
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import scipy.stats
# any additional libraries would be imported here
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
#define data set to be used
mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False)
#data management-creating selfesteem types of 1 and2 good selfesteem, 4 &5 not good selfesteem
def selfesteem(row):
if row['H1PF33'] == 1:
return 1
elif row['H1PF33'] == 2 :
return 1
elif row['H1PF33'] == 4 :
return 0
elif row['H1PF33'] == 5 :
return 0
mydata['selfesteem'] = mydata.apply (lambda row: selfesteem (row),axis=1)
# Count of records in each option selected for selfesteem
print("% of selfesteem")
pse1 = mydata["selfesteem"].value_counts(sort=True, normalize= False)
print(pse1)
mydata['H1GI8'] = pandas.to_numeric(mydata['H1GI8'])
#cleanup of ETHNICITY so we get only 1, 2,3, 4 #Set missing data to NAN
mydata['H1GI8']= mydata['H1GI8'].replace(5, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(6, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(7, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(8, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(9, numpy.nan)
#recoding values for H1GI8 into a new variable, USFREQMO
recode1 = {1: "American", 2: "Black", 3: "AmericanIndian", 4: "Asian"}
mydata['ethnicity']= mydata['H1GI8'].map(recode1)
# contingency table of observed counts
ct1=pandas.crosstab(mydata['selfesteem'], mydata['ethnicity'])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
# graph percent with self esteem within each ethnicity group
seaborn.factorplot(x="ethnicity", y="selfesteem", data=mydata, kind="bar", ci=None)
plt.xlabel('ethnicity')
plt.ylabel('Self esteem')
# make self esteem categorical
mydata['selfesteem'] = mydata['selfesteem'].astype('category')
# compare only two ethnicities
recode2 = {1: "American", 2: "Black"}
mydata['COMP1v2']= mydata['H1GI8'].map(recode2)
print( "comparing American and Black")
print( "---------------------------------------")
# contingency table of observed counts
ct2=pandas.crosstab(mydata['selfesteem'], mydata['COMP1v2'])
print (ct2)
# column percentages for two ethnicities
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
# compare next two ethnicities
recode3 = {1: "American", 3: "AmericanIndian"}
mydata['COMP1v3']= mydata['H1GI8'].map(recode3)
print( "comparing American and American Indian")
print( "---------------------------------------")
# contingency table of observed counts
ct3=pandas.crosstab(mydata['selfesteem'], mydata['COMP1v3'])
print (ct3)
# column percentages
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
# compare next two ethnicities
print( "comparing American and Asian")
print( "---------------------------------------")
recode4 = {1: "American", 4: "Asian"}
mydata['COMP1v4']= mydata['H1GI8'].map(recode4)
# contingency table of observed counts
ct4=pandas.crosstab(mydata['selfesteem'], mydata['COMP1v4'])
print (ct4)
# column percentages
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)
# compare next two ethnicities
recode5 = {2: "Black", 3: "AmericanIndian"}
mydata['COMP2v3']= mydata['H1GI8'].map(recode5)
print( "comparing Black and American Indian")
print( "---------------------------------------")
# contingency table of observed counts
ct5=pandas.crosstab(mydata['selfesteem'], mydata['COMP2v3'])
print (ct5)
# column percentages
colsum=ct5.sum(axis=0)
colpct=ct5/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)
# compare next two ethnicities
recode6 = {2: "Black", 4: "Asian"}
mydata['COMP2v4']= mydata['H1GI8'].map(recode6)
print( "comparing Black and Asian")
print( "---------------------------------------")
# contingency table of observed counts
ct6=pandas.crosstab(mydata['selfesteem'], mydata['COMP2v4'])
print (ct6)
# column percentages
colsum=ct6.sum(axis=0)
colpct=ct6/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs6= scipy.stats.chi2_contingency(ct6)
print (cs6)
# compare next two ethnicities
recode7 = {3: "AmericanIndian", 4: "Asian"}
mydata['COMP3v4']= mydata['H1GI8'].map(recode7)
print( "comparing AmericanIndian and Asian")
print( "---------------------------------------")
# contingency table of observed counts
ct7=pandas.crosstab(mydata['selfesteem'], mydata['COMP3v4'])
print (ct7)
# column percentages
colsum=ct7.sum(axis=0)
colpct=ct7/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs7= scipy.stats.chi2_contingency(ct7)
print (cs7)
Results
ethnicity American AmericanIndian Asian Black
selfesteem
0.000000 18 1 3 11
1.000000 97 28 17 72
ethnicity American AmericanIndian Asian Black
selfesteem
0.000000 0.156522 0.034483 0.150000 0.132530
1.000000 0.843478 0.965517 0.850000 0.867470
In the graph above we can see AmericanIndian have higher self-esteem compared to other ethnicities.
chi-square value, p value, expected counts
(3.0305672660644833, 0.38693627395786484, 3, array([[15.36437247, 3.87449393, 2.67206478, 11.08906883],
[99.63562753, 25.12550607, 17.32793522, 71.91093117]]))
The chi-square = 3.03 with a p = .38. So assuming null hypothesis is true (ethnicity and self- esteem are independent), a chi-square of 3.03 is very low, and a p-value of .38 indicates that we cannot say that there is a relation between ethnicity and self-esteem. We cannot reject null hypothesis.
Now we will compare each pair of ethnicity to determine if there are any categories where the expected counts are different from observed counts. Since we have 4 categories, we will have to do 6 comparisons. The value of p will have to be less than .0083. Here are chi-square and p-values for all 6 comparisons.
Here is the final chart for all p-value and chi-square comparisons:
American Versus Black – Chi-square=0.0715, p=0.1315
American Versus AmericanIndian – Chi-square=2.04, p=0.0255
American Versus Asian – Chi-square=0.0675, p=0.132
Black Versus AmericanIndian – Chi-square=1.25, p=0.0437
Black Versus Asian – Chi-square=0.025, p=0.1456
AmericanIndian Versus Asian – Chi-square=0.847, p=.0595
1st comparison
comparing American and Black
---------------------------------------
COMP1v2 American Black
selfesteem
0.0 18 11
1.0 97 72
COMP1v2 American Black
selfesteem
0.0 0.156522 0.132530
1.0 0.843478 0.867470
chi-square value, p value, expected counts
(0.07153049981033592, 0.7891212711774603, 1, array([[16.84343434, 12.15656566],
[98.15656566, 70.84343434]]))
Here p=.789/6 = .1315 (has to be < than .0083), which is not significant. So there is no difference between American and Black self-esteem numbers. Cannot reject null hypothesis
2nd Comparison
comparing American and American Indian
---------------------------------------
COMP1v3 American AmericanIndian
selfesteem
0.0 18 1
1.0 97 28
COMP1v3 American AmericanIndian
selfesteem
0.0 0.156522 0.034483
1.0 0.843478 0.965517
chi-square value, p value, expected counts
(2.0402935374418054, 0.15318008443661052, 1, array([[15.17361111, 3.82638889],
[99.82638889, 25.17361111]]))
Here p=.153/6 = .0255 which is not significant (has to be less than .0083). So there is no difference between American and American Indian self-esteem numbers. Cannot reject null hypothesis.
3rd comparison
comparing American and Asian
---------------------------------------
COMP1v4 American Asian
selfesteem
0.0 18 3
1.0 97 17
COMP1v4 American Asian
selfesteem
0.0 0.156522 0.150000
1.0 0.843478 0.850000
chi-square value, p value, expected counts
(0.06757723112128144, 0.7948975519327164, 1, array([[17.88888889, 3.11111111],
[97.11111111, 16.88888889]]))
Here p=.7948/6 = .132 which is not significant (has to be less than .0083) . So there is no difference between American and Asian self-esteem numbers. Cannot reject null hypothesis
4th comparison
comparing Black and American Indian
---------------------------------------
COMP2v3 AmericanIndian Black
selfesteem
0.0 1 11
1.0 28 72
COMP2v3 AmericanIndian Black
selfesteem
0.0 0.034483 0.132530
1.0 0.965517 0.867470
chi-square value, p value, expected counts
(1.2563356875778986, 0.26234583628669483, 1, array([[ 3.10714286, 8.89285714],
[25.89285714, 74.10714286]]))
Here p=.2623/6 = .043 which is not significant (has to be less than .0083). So there is no difference between AmericanIndian and Black self-esteem numbers. Cannot reject null hypothesis.
5th comparison
comparing Black and Asian
---------------------------------------
COMP2v4 Asian Black
selfesteem
0.0 3 11
1.0 17 72
COMP2v4 Asian Black
selfesteem
0.0 0.150000 0.132530
1.0 0.850000 0.867470
chi-square value, p value, expected counts
(0.025210190682473096, 0.8738444339085241, 1, array([[ 2.7184466, 11.2815534],
[17.2815534, 71.7184466]]))
Here p=.8738/6 = .1456 which is not significant (has to be less than .0083). So there is no difference between Asian and Black self-esteem numbers. Cannot reject null hypothesis
6th comparison
comparing AmericanIndian and Asian
---------------------------------------
COMP3v4 AmericanIndian Asian
selfesteem
0.0 1 3
1.0 28 17
COMP3v4 AmericanIndian Asian
selfesteem
0.0 0.034483 0.150000
1.0 0.965517 0.850000
chi-square value, p value, expected counts
(0.8477610153256708, 0.35718650823889897, 1, array([[ 2.36734694, 1.63265306],
[26.63265306, 18.36734694]]))
Here p=.357/6 = .0595 which is not significant (has to be less than .0083). So there is no difference between AmericanIndian and Asian self-esteem numbers. Cannot reject null hypothesis.
0 notes
Text
Data Analysis Tools - Assignment 1
Data Analysis Tools
Assignment 1
I am using addhealth data and checking if there is relationship between ethnicity and romantic relation start age. I am using Explanatory categorical variable H1GI8 and response quantitative variable H1RI3_1.
Null Hypothesis – no relation between ethnicity and romantic relation start age
Alternate Hypothesis : there is relation between ethnicity and romantic relation start age
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 23 15:49:41 2020
Assignment 4
@author: GB8PM0
"""
# import pandas and numpy
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False)
#making individual ethnicity variables numeric
mydata['H1GI8'] = pandas.to_numeric(mydata['H1GI8'])
#cleanup of ETHNICITY so we get only 1, 2,3, 4 #Set missing data to NAN
mydata['H1GI8']= mydata['H1GI8'].replace(5, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(6, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(7, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(8, numpy.nan)
mydata['H1GI8']= mydata['H1GI8'].replace(9, numpy.nan)
##Set missing data to NAN for age of romantic relation start
mydata = mydata.replace(r'^\s*$', numpy.NaN, regex=True)
mydata['H1RI3_1'].fillna("95", inplace=True)
mydata['H1RI3_1'] = mydata['H1RI3_1'].replace("95", numpy.nan)
mydata['H1RI3_1'] = mydata['H1RI3_1'].replace("96", numpy.nan)
mydata['H1RI3_1'] = mydata['H1RI3_1'].replace("97", numpy.nan)
mydata['H1RI3_1'] = mydata['H1RI3_1'].replace("98", numpy.nan)
#quantitative response variable
mydata['H1RI3_1'] = pandas.to_numeric(mydata['H1RI3_1'])
# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula='H1RI3_1 ~ C(H1GI8)', data = mydata)
results1 = model1.fit()
print (results1.summary())
sub2 = mydata[['H1RI3_1', 'H1GI8']].dropna()
# group means by ethnicity
print ('means for age started romantic relation by ethnicity')
m1= sub2.groupby('H1GI8').mean()
print (m1)
#standard deviation by ethnicity
print ('standard deviations for age started romantic relation by ethnicity')
sd1 = sub2.groupby('H1GI8').std()
print (sd1)
# compare two ethnicities at a time
mc1 = multi.MultiComparison(sub2['H1RI3_1'], sub2['H1GI8'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Results
OLS Regression Results
==============================================================================
Dep. Variable: H1RI3_1 R-squared: 0.009
Model: OLS Adj. R-squared: -0.006
Method: Least Squares F-statistic: 0.5893
Date: Wed, 27 Jan 2021 Prob (F-statistic): 0.623
Time: 13:35:27 Log-Likelihood: -446.89
No. Observations: 192 AIC: 901.8
Df Residuals: 188 BIC: 914.8
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 15.6224 0.253 61.686 0.000 15.123 16.122
C(H1GI8)[T.2.0] 0.1064 0.413 0.257 0.797 -0.709 0.921
C(H1GI8)[T.3.0] -0.7177 0.603 -1.190 0.235 -1.907 0.472
C(H1GI8)[T.4.0] 0.0204 0.716 0.028 0.977 -1.393 1.433
==============================================================================
Omnibus: 69.001 Durbin-Watson: 1.840
Prob(Omnibus): 0.000 Jarque-Bera (JB): 295.473
Skew: 1.334 Prob(JB): 6.90e-65
Kurtosis: 8.460 Cond. No. 4.45
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for age started romantic relation by ethnicity
H1RI3_1
H1GI8
1.000000 15.622449
2.000000 15.728814
3.000000 14.904762
4.000000 15.642857
standard deviations for age started romantic relation by ethnicity
H1RI3_1
H1GI8
1.000000 2.775130
2.000000 2.483258
3.000000 1.179185
4.000000 1.945691
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------
1.0 2.0 0.1064 0.9 -0.9646 1.1773 False
1.0 3.0 -0.7177 0.6179 -2.2805 0.8451 False
1.0 4.0 0.0204 0.9 -1.8365 1.8773 False
2.0 3.0 -0.8241 0.5603 -2.4755 0.8274 False
2.0 4.0 -0.086 0.9 -2.0181 1.8462 False
3.0 4.0 0.7381 0.8066 -1.5044 2.9805 False
Explanation:
The means of age are listed below:
1.000000
White
15.622449
2.000000
Black or African American
15.728814
3.000000
American Indian
14.904762
4.000000
Asian
15.642857
Based on the data American started having romantic relations earliest at 14.9 age and African American started the last at 15.72 age.
F-statistic: 0.5893
Prob (F-statistic): 0.623
Here the P value of 0.623 is greater than .05 and so we can accept null hypothesis. Which means there is no relation between Ethnicity and romantic relation start age.
0 notes
Text
Data Management And Visualization - Assignment 4
Assignment 4 Python Code for Assignment 4 # -*- coding: utf-8 -*- “”“ Created on Wed Dec 23 15:49:41 2020 Assignment 4 @author: GB8PM0 ”“” #%% # import pandas and numpy import pandas import numpy import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here #Set PANDAS to show all columns in DataFrame pandas.set_option(‘display.max_columns’, None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows’, None) # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format’, lambda x:’%f’%x) #define data set to be used mydata = pandas.read_csv('addhealth_pds.csv’, low_memory=False) #data management- create happiness types of 1 and2 don’t get sad, 4 &5 get sad def happiness(row): if row['H1PF10’] == 1: return 1 elif row['H1PF10’] == 2 : return 1 elif row['H1PF10’] == 4 : return 0 elif row['H1PF10’] == 5 : return 0 mydata['happiness’] = mydata.apply (lambda row: happiness (row),axis=1) # Count of records in each option selected for happiness print(“ph1 - % of happiness”) ph1 = mydata[“happiness”].value_counts(sort=True, normalize= True) * 100 print(ph1) # plot univariate graph of happiness seaborn.countplot(x=“happiness”, data=mydata) plt.xlabel('happiness’) plt.title('happiness level’) #%% #data management-creating satisfaction types of 1 and2 satisfied, 4 &5 Not Satisfied 3 neither def satisfaction(row): if row['H1PF5’] == 1: return 1 elif row['H1PF5’] == 2 : return 1 elif row['H1PF5’] == 4 : return 2 elif row['H1PF5’] == 5 : return 2 mydata['satisfaction’] = mydata.apply (lambda row: satisfaction (row),axis=1) # you can rename categorical variable values for graphing if original values are not informative # first change the variable format to categorical if you haven’t already done so mydata['satisfaction’] = mydata['satisfaction’] .astype('category’) # second create a new variable that has the new variable value labels mydata['satisfaction’] =mydata['satisfaction’].cat.rename_categories([“Satisfied”, “Not Satisfied”]) # % of records in each option selected for mother satifaction print(“ps1 - % of satisfaction - Satisfaction with Mother”) ps1 = mydata[“satisfaction”].value_counts(sort=True, normalize=True) * 100 print(ps1) # plot univariate graph of satisfaction seaborn.countplot(x=“satisfaction”, data=mydata) plt.xlabel('Mother satisfaction’) plt.title('Relationship with Mother satisfaction’) #%% #data management-creating selfesteem types of 1 and2 good selfesteem, 4 &5 not good selfesteem def selfesteem(row): if row['H1PF33’] == 1: return 1 elif row['H1PF33’] == 2 : return 1 elif row['H1PF33’] == 4 : return 0 elif row['H1PF33’] == 5 : return 0 mydata['selfesteem’] = mydata.apply (lambda row: selfesteem (row),axis=1) # Count of records in each option selected for selfesteem print(“% of selfesteem”) pse1 = mydata[“selfesteem”].value_counts(sort=True, normalize= True) print(pse1) # plot univariate graph of selfesteem seaborn.countplot(x=“selfesteem”, data=mydata) plt.xlabel('selfesteem’) plt.title('selfesteem level’) #plot bivariate bar graph C->C satisfaction and hapiness seaborn.catplot(x='satisfaction’, y='happiness’, data=mydata, kind=“bar”, ci=None) plt.xlabel('Relationship With Mother satisfaction’) plt.ylabel('happiness’) Variable 1 – happiness (H1PF10) Univariate graph of happiness. Created two groupings of “0” and “1”. 0 is for adolescents who felt sad and 1 is for adolescents who were not sad. 4435 of adolescents felt sadness, and 932 of adolescents were happy (not sad). Variable 2 – Satisfaction with Relation with Mother Created a Univariate graph of satisfaction with Mother relationship. Created two groupings of “Satisfied and “Not Satisfied”. Around 5404 adolescents were satisfied with the relationship with their mother, and around 363 were not satisfied with the relationship with their mother Variable 3 – Self Esteem Created two groupings of “0” and “1”. 0 is for adolescents who do not like themselves (have low self esteem), and 1 is for adolescents who like themselves (have high self esteem) . 5022 adolescents had high self esteem, and 592 had high self esteem Relationship with Mother Versus Happiness Created a bivariate graph between satisfaction with mother relationship to happiness. It looks like from the graph the 17.8% of adolescents who are satisfied with their relationship and are happy, whereas only 5% of adolescents who are not satisfied are happy
1 note
·
View note
Text
Data Management And Visualization - Assignment 4
Assignment 4 Python Code for Assignment 4 # -*- coding: utf-8 -*- """ Created on Wed Dec 23 15:49:41 2020 Assignment 4 @author: GB8PM0 """ #%% # import pandas and numpy import pandas import numpy import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here #Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None) # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x) #define data set to be used mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False) #data management- create happiness types of 1 and2 don't get sad, 4 &5 get sad def happiness(row): if row['H1PF10'] == 1: return 1 elif row['H1PF10'] == 2 : return 1 elif row['H1PF10'] == 4 : return 0 elif row['H1PF10'] == 5 : return 0 mydata['happiness'] = mydata.apply (lambda row: happiness (row),axis=1) # Count of records in each option selected for happiness print("ph1 - % of happiness") ph1 = mydata["happiness"].value_counts(sort=True, normalize= True) * 100 print(ph1) # plot univariate graph of happiness seaborn.countplot(x="happiness", data=mydata) plt.xlabel('happiness') plt.title('happiness level') #%% #data management-creating satisfaction types of 1 and2 satisfied, 4 &5 Not Satisfied 3 neither def satisfaction(row): if row['H1PF5'] == 1: return 1 elif row['H1PF5'] == 2 : return 1 elif row['H1PF5'] == 4 : return 2 elif row['H1PF5'] == 5 : return 2 mydata['satisfaction'] = mydata.apply (lambda row: satisfaction (row),axis=1) # you can rename categorical variable values for graphing if original values are not informative # first change the variable format to categorical if you haven’t already done so mydata['satisfaction'] = mydata['satisfaction'] .astype('category') # second create a new variable that has the new variable value labels mydata['satisfaction'] =mydata['satisfaction'].cat.rename_categories(["Satisfied", "Not Satisfied"]) # % of records in each option selected for mother satifaction print("ps1 - % of satisfaction - Satisfaction with Mother") ps1 = mydata["satisfaction"].value_counts(sort=True, normalize=True) * 100 print(ps1) # plot univariate graph of satisfaction seaborn.countplot(x="satisfaction", data=mydata) plt.xlabel('Mother satisfaction') plt.title('Relationship with Mother satisfaction') #%% #data management-creating selfesteem types of 1 and2 good selfesteem, 4 &5 not good selfesteem def selfesteem(row): if row['H1PF33'] == 1: return 1 elif row['H1PF33'] == 2 : return 1 elif row['H1PF33'] == 4 : return 0 elif row['H1PF33'] == 5 : return 0 mydata['selfesteem'] = mydata.apply (lambda row: selfesteem (row),axis=1) # Count of records in each option selected for selfesteem print("% of selfesteem") pse1 = mydata["selfesteem"].value_counts(sort=True, normalize= True) print(pse1) # plot univariate graph of selfesteem seaborn.countplot(x="selfesteem", data=mydata) plt.xlabel('selfesteem') plt.title('selfesteem level') #plot bivariate bar graph C->C satisfaction and hapiness seaborn.catplot(x='satisfaction', y='happiness', data=mydata, kind="bar", ci=None) plt.xlabel('Relationship With Mother satisfaction') plt.ylabel('happiness') Variable 1 – happiness (H1PF10) Univariate graph of happiness. Created two groupings of “0” and “1”. 0 is for adolescents who felt sad and 1 is for adolescents who were not sad. 4435 of adolescents felt sadness, and 932 of adolescents were happy (not sad). Variable 2 – Satisfaction with Relation with Mother Created a Univariate graph of satisfaction with Mother relationship. Created two groupings of “Satisfied and “Not Satisfied”. Around 5404 adolescents were satisfied with the relationship with their mother, and around 363 were not satisfied with the relationship with their mother Variable 3 – Self Esteem Created two groupings of “0” and “1”. 0 is for adolescents who do not like themselves (have low self esteem), and 1 is for adolescents who like themselves (have high self esteem) . 5022 adolescents had high self esteem, and 592 had high self esteem Relationship with Mother Versus Happiness Created a bivariate graph between satisfaction with mother relationship to happiness. It looks like from the graph the 17.8% of adolescents who are satisfied with their relationship and are happy, whereas only 5% of adolescents who are not satisfied are happy
1 note
·
View note
Text
Data Management And Visualization - Assignment 3
Assignment3
Python Code for Assignment 3
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 23 15:49:41 2020
@author: GB8PM0
"""
# import pandas and numpy
import pandas
import numpy
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
#define data set to be used
mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False)
#creating satisfaction types of 1 and2 satisfied, 4 &5 Not Satisfied 3 neither
def satisfaction(row):
if row['H1PF5'] == 1:
return 1
elif row['H1PF5'] == 2 :
return 1
elif row['H1PF5'] == 4 :
return 2
elif row['H1PF5'] == 5 :
return 2
mydata['satisfaction'] = mydata.apply (lambda row: satisfaction (row),axis=1)
# you can rename categorical variable values for graphing if original values are not informative
# first change the variable format to categorical if you haven’t already done so
mydata['satisfaction'] = mydata['satisfaction'] .astype('category')
# second create a new variable that has the new variable value labels
mydata['satisfaction'] =mydata['satisfaction'].cat.rename_categories(["Satisfied", "Not Satisfied"])
# countof records in each option selected for mother satifaction
print(" s1 - count of satisfaction - Satisfaction with Mother")
s1 = mydata["satisfaction"].value_counts(sort=True)
print(s1)
# % of records in each option selected for mother satifaction
print("ps1 - % of satisfaction - Satisfaction with Mother")
ps1 = mydata.groupby("satisfaction").size() * 100 / len(mydata)
print(ps1)
#creating happiness types of 1 and2 don't get sad, 4 &5 get sad
def happiness(row):
if row['H1PF10'] == 1:
return 1
elif row['H1PF10'] == 2 :
return 1
elif row['H1PF10'] == 4 :
return 0
elif row['H1PF10'] == 5 :
return 0
mydata['happiness'] = mydata.apply (lambda row: happiness (row),axis=1)
# Count of records in each option selected for happiness
print("h1 - Count of happiness - happiness")
h1 = mydata["happiness"].value_counts(sort=True)
print(h1)
# Count of records in each option selected for happiness
print("ph1 - % of H1PF10 - happiness")
ph1 = mydata["happiness"].value_counts(sort=True, normalize= True) * 100
print(ph1)
#creating selfesteem types of 1 and2 good selfesteem, 4 &5 not good selfesteem
def selfesteem(row):
if row['H1PF33'] == 1:
return 1
elif row['H1PF33'] == 2 :
return 1
elif row['H1PF33'] == 4 :
return 0
elif row['H1PF33'] == 5 :
return 0
mydata['selfesteem'] = mydata.apply (lambda row: selfesteem (row),axis=1)
# Count of records in each option selected for selfesteem
print("se1 - Count of selfesteem - selfesteem")
se1 = mydata["selfesteem"].value_counts(sort=True)
print(se1)
# Count of records in each option selected for selfesteem
print("pse1 - % of selfesteem")
pse1 = mydata["selfesteem"].value_counts(sort=True, normalize= True) * 100
print(pse1)
Variable 1 – H1PF5
Created two groupings of “Satisfied and “Not Satisfied”. 83% were satisfied with the relationship with their mother, and 5.58% were not satisfied with the relationship with their mother
s1 - count of satisfaction - Satisfaction with Mother
Satisfied 5404
Not Satisfied 363
Name: satisfaction, dtype: int64
ps1 - % of satisfaction - Satisfaction with Mother
Satisfied 83.087331
Not Satisfied 5.581181
dtype: float64
Variable 2 – H1PF10
Created two groupings of “0” and “1”. 0 is for adolescents who felt sad and 1 is for adolescents who were not sad. 82.63% of adolescents felt sadness, and 17.36% of adolescents were happy (not sad)
h1 - Count of happiness - happiness
0.000000 4435
1.000000 932
Name: happiness, dtype: int64
ph1 - % of H1PF10 - happiness
0.000000 82.634619
1.000000 17.365381
Name: happiness, dtype: float64
Variable 3 – H1PF33
Created two groupings of “0” and “1”. 0 is for adolescents who do not like themselves (have low self esteem), and 1 is for adolescents who like themselves (have high self esteem) . 89.45% adolescents had high self esteem, and 10.54 % had high self esteem
se1 - Count of selfesteem - selfesteem
1.000000 5022
0.000000 592
Name: selfesteem, dtype: int64
pse1 - % of selfesteem
1.000000 89.454934
0.000000 10.545066
Name: selfesteem, dtype: float64
0 notes
Text
Data Management and Visualization - Assignment2
Assignment 2
Python Code For Assignment 1
"""
Created on Wed Dec 23 15:49:41 2020
@author: GB8PM0
"""
# import pandas and numpy
import pandas
import numpy
#define data set to be used
mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False)
print("number of rows and column")
#number of rows
print(len(mydata))
# number of columns
print(len(mydata.columns))
# Count of records in each option selected for mother satifaction
print("Count of H1PF5 - Satisfaction with Mother")
c1 = mydata["H1PF5"].value_counts(sort=True, dropna=False)
print(c1)
# % of records in each option selected for mother satifaction
print("% of H1PF5 - Satisfaction with Mother")
p1 = mydata.groupby("H1PF5",dropna=False).size() * 100 / len(mydata)
print(p1)
# Count of records in each option selected for feeling sadness
print("Count of H1PF10 -Feeling Sadness")
c2 = mydata["H1PF10"].value_counts(sort=True, dropna=False)
print(c2)
# % of records in each option selected for feeling sadness
print("% of H1PF10 -Feeling Sadness")
p2 = mydata.groupby("H1PF10", dropna=False).size() * 100 / len(mydata)
print(p2)
# Count of records in each option selected for self-esteem
print("Count of H1PF33 - self-esteem")
c3 = mydata["H1PF33"].value_counts(sort=True, dropna=False)
print(c3)
# % of records in each option selected for self-esteem
print("% of H1PF33 - self-esteem")
p3 = mydata.groupby("H1PF33", dropna=False).size() * 100 / len(mydata)
print(p3)
Results Of Python Assignment 2
Variable 1 – H1PF5
A random sample of 6,504 adolescents were asked the following question, “ Overall, you are satisfied with your relationship with your mother?” Of the total number, about 46.77 % chose category 1 which was “Strongly Agree” and about 36.31% fell into category 2 (“Agree”). About 5.68% gave legitimate skip question
6504
2829
Count of H1PF5 - Satisfaction with Mother
1 3042
2 2362
7 370
3 354
4 266
5 97
8 8
6 4
9 1
Name: H1PF5, dtype: int64
% of H1PF5 - Satisfaction with Mother
H1PF5
1 46.771218
2 36.316113
3 5.442804
4 4.089791
5 1.491390
6 0.061501
7 5.688807
8 0.123001
9 0.015375
dtype: float64
Variable 2 – H1PF10
For the next question, the same students were asked the question: “You never get sad.?” About 54.78 said 4 (Strongly Disagree) about 17.15 said 3 (Neither agree nor disagree).
Count of H1PF10 -Feeling Sadness
4 3563
3 1116
5 872
2 712
1 220
8 14
6 7
Name: H1PF10, dtype: int64
% of H1PF10 -Feeling Sadness
H1PF10
1 3.382534
2 10.947109
3 17.158672
4 54.781673
5 13.407134
6 0.107626
8 0.215252
dtype: float64
Variable 3 – H1PF33
For the next question, the same students were asked the question You like yourself just the way you are. About 42.65% said 2 (Agree) about 34.56% said 1 (Strongly Agree).
Count of H1PF33 - self-esteem
2 2774
1 2248
3 868
4 534
5 58
8 12
6 10
Name: H1PF33, dtype: int64
% of H1PF33 - self-esteem
H1PF33
1 34.563346
2 42.650677
3 13.345633
4 8.210332
5 0.891759
6 0.153752
8 0.184502
dtype: float64
0 notes
Text
Data Management and Visualization - Assignment 1
Step 1-3
STEP 1. Choose a data set that you would like to work with.
STEP 2. Identify a specific topic of interest
STEP 3. Prepare a codebook of your own (i.e., print individual pages or copy screen and paste into a new document) from the larger codebook that includes the questions/items/variables that measure your selected topics.)
I am a Mother with two adolescents and interested in understanding how mother’s relationship effects the adolescents, so that I can be a better mother. I would like to explore the AddHeath data.
There are lot of adolescents who through depression, loneliness which then lead to risky behavior like drugs, alcoholism, teenage pregnancies etc. I would like to understand the mother’s relationship and the effect on Adolescents Sadness. I will be looking at “Section 18: Personality and Family”. In it I will look Adolescents sadness with Adolescent’s satisfaction in Mother’s relation. The variables considered for analysis are - H1PF5 and H1PF10. Here is my codebook.
5. Overall, you are satisfied with your relationship with your mother H1PF5 num 1
3042 1 Strongly agree
2362 2 agree
354 3 Neither agree nor disagree
266 4 disagree
97 5 Strongly disagree
4 6 refused
370 7 Legitimate skip [no resident MOM]
8 8 don’t know
1 9 Not applicable
10. You never get sad. H1PF10 num 1
220 1 Strongly agree
712 2 agree
1116 3 Neither agree nor disagree
3563 4 disagree
872 5 Strongly disagree
7 6 refused
14 8 don’t know
Step 4-5
STEP 4. Identify a second topic that you would like to explore in terms of its association with your original topic.
STEP 5. Add questions/items/variables documenting this second topic to your personal codebook.
There are a number of Adolescents do not feel good about themselves. The second topic I would like to explore is the relation between the self-esteem to sadness. I will look at variable H1PF33.
33. You like yourself just the way you are. H1PF33 num 1
2248 1 strongly agree
2774 2 agree
868 3 neither agree nor disagree
534 4 disagree
58 5 strongly disagree
10 6 refused
12 8 don’t know
Step 6
STEP 6. Perform a literature review to see what research has been previously done on this topic. Use sites such as Google Scholar (http://scholar.google.com) to search for published academic work in the area(s) of interest. Try to find multiple sources, and take note of basic bibliographic information.
Looked at
https://journals.sagepub.com/doi/abs/10.1177/0192513X04270262
https://link.springer.com/article/10.1023/B:JOYO.0000025322.11510.9d
https://link.springer.com/article/10.1007/s10964-005-9009-2
Step 7
STEP 7. Based on your literature review, develop a hypothesis about what you believe the association might be between these topics. Be sure to integrate the specific variables you selected into the hypothesis.
Positive relation with mother will have a better outcome on the Psychological wellbeing and better self-esteem.
1 note
·
View note