datascienceprojects-blog
datascienceprojects-blog
Data Science Assignments
5 posts
Don't wanna be here? Send us removal request.
datascienceprojects-blog · 6 years ago
Text
Logistic Regression Model
Introduction
A Logistic Regression Model to establish any relationship between Anti-Social Behavioural Disorder and drug  (Cannabis) dependence. Another logistic model between Anti-Social Behavioural Disorder after controlling for Alcohol abuse and a history of family alcoholism.
This submission is split into the following sections:
 Hypotheses
 Data
 Methodology
 Results
 Summary
 Output from Code (Appendix I)
 Code (Appendix II)
Hypothesis
Null Hypothesis: 
There is no significant association between Anti-Social Behavioural Disorder and Drug Dependence.
Alternate Hypothesis
There is a significant association between Anti-Social Behavioural Disorder and Drug dependence.
Data
The as subset of NESARC data was used from the following sections:
ANTISOCIAL PERSONALITY DISORDER (BEHAVIOR
CANNABIS ABUSE/DEPENDENCE
ALCOHOL ABUSE/DEPENDENCE
FAMILY HISTORY (I) OF ALCOHOLISM
FAMILY HISTORY (II) OF DRUG ABUSE
Response Variable -  The response variable, ANTISOC, is binary and denotes the presence (yes), coded 1 or the absence, (No), coded 0 of Anti-Social Personality Disorder .
Exploratory Variables - are all also binary and include:
DRUGDEP - a Yes indicated cannabis abuse and dependence in both the past 12 months and prior to the last 12 months was coded 1 for the presence of drug dependence, No was coded 0 for its absence.
FAMDRUGDEP - Yes was coded 1 for the presence of family drug abuse while No, was coded 0 for its absence.
ALCHABUS - Yes was coded 1 for the presence of Alcohol Abuse and No, was coded 0 for its absence.
FAMALCSM - Yes was coded 1 for the presence of Family History of Alcohol Abuse and No was coded 0 for its absence.
Methodology
A Logit Regression was performed using the logit formula from Python’s Stats Model API and the output examined for association, significance, confounders and the confidence interval established via odds ratios.
Results
The sample size was 1574 and the regression between Antisocial Behaviour and Drug Dependence was significant with a P-value of less than 0.0001.
It was established that Anti-Social Behaviour was positively associated with drug dependence as and the odds ratio between those who have Anti Social Behaviour and those who were drug dependent was 2.94 meaning that the probability of having Anti Social Behaviour increased with subjects having drug dependence and subjects in the sample with Drug Dependence were 2,94 times likely to have Anti-Social behavioural problems than those subjects without drug dependence. 
The odds ratio indicated a 95% certainty that the true population odds ratio falls between 2.1 and 4.12.
Both Drug Abuse and Alcohol Abuse were found to be positively associated with the likelihood of Anti-social behaviour. And the results suggested that subjects with Drug Dependence in the sample were 2.9 times more likely to have Anti-Social Behaviour disorders after controlling for Alcohol abuse.
Similarly Subjects with Alcohol Abuse were approximately 1.4 times more likely to have Anti-Social Behaviour disorders after controlling for Drug Dependence.
When a history of Family Alcoholism or a history of Family Drug Dependence was added to the model, there were no changes in the significant positive association between Anti-Social Behaviour and Drug Dependence after controlling for Alcohol Abuse. This suggested that a history of family alcohol abuse or family history of drug abuse were not established as confounders of the relationship between anti-social behaviour disorder and drug abuse. 
Summary
A logit regression was performed on NESARC data, to investigate the association between drug (marijuana) dependence and Anti-Social Behaviour Disorder, when alcohol abuse and family histories of drug and alcohol abuse were taken into account . The results from analysis of the sample data showed that marijuana dependence and abuse was positively and significantly associated with incidence of anti-social behaviour disorder.  While it was deduced that alcohol abuse was positively associated with anti-social behaviour disorder, this was not found to be significant. Family histories of drug and alcohol dependence were not established as confounders. 
Appendix I
Output From Logit Regession
Tumblr media Tumblr media Tumblr media Tumblr media
Appendix II
Code
import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x)
data = pandas.read_csv('C:\\...\\nesarc.csv', low_memory=False)
############################################################################## # DATA MANAGEMENT ##############################################################################
#setting variables to numeric data['IDNUM'] =pandas.to_numeric(data['IDNUM'], errors='coerce') data['TAB12MDX'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce') data['S3AQ3C1'] = pandas.to_numeric(data['S3AQ3C1'], errors='coerce') data['S3BQ3'] = pandas.to_numeric(data['S3BQ3'], errors='coerce') data['AGE'] =pandas.to_numeric(data['AGE'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce') data['MAR12ABDEP'] = pandas.to_numeric(data['MAR12ABDEP'], errors='coerce') data['MARP12ABDEP'] = pandas.to_numeric(data['MARP12ABDEP'], errors='coerce') data['S3BD5Q2C'] = pandas.to_numeric(data['S3BD5Q2C'], errors='coerce') data['S3BD5Q2E'] = pandas.to_numeric(data['S3BD5Q2E'], errors='coerce') data['S3BQ1A5'] = pandas.to_numeric(data['S3BQ1A5'], errors='coerce') data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan) data['S3BQ3']=data['S3BQ3'].replace(99, numpy.nan)
#Family Alchholism data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce') data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce') data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce') data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce') data['S2DQ5C2'] = pandas.to_numeric(data['S2DQ5C2'], errors='coerce') data['S2DQ6C2'] = pandas.to_numeric(data['S2DQ6C2'], errors='coerce') data['S2DQ7C2'] = pandas.to_numeric(data['S2DQ7C2'], errors='coerce') data['S2DQ8C2'] = pandas.to_numeric(data['S2DQ8C2'], errors='coerce') data['S2DQ9C2'] = pandas.to_numeric(data['S2DQ9C2'], errors='coerce') data['S2DQ10C2'] = pandas.to_numeric(data['S2DQ10C2'], errors='coerce') data['S2DQ11'] = pandas.to_numeric(data['S2DQ11'], errors='coerce') data['S2DQ12'] = pandas.to_numeric(data['S2DQ12'], errors='coerce') data['S2DQ13A'] = pandas.to_numeric(data['S2DQ13A'], errors='coerce') data['S2DQ13B'] = pandas.to_numeric(data['S2DQ13B'], errors='coerce') data['S2DQ14A'] = pandas.to_numeric(data['S2DQ14A'], errors='coerce') data['S2DQ14B'] = pandas.to_numeric(data['S2DQ14B'], errors='coerce') data['S2DQ15'] = pandas.to_numeric(data['S2DQ15'], errors='coerce') data['S2DQ16'] = pandas.to_numeric(data['S2DQ16'], errors='coerce') data['S2DQ17'] = pandas.to_numeric(data['S2DQ17'], errors='coerce') data['S2DQ18'] = pandas.to_numeric(data['S2DQ18'], errors='coerce') data['S2DQ19'] = pandas.to_numeric(data['S2DQ19'], errors='coerce') data['S2DQ21'] = pandas.to_numeric(data['S2DQ21'], errors='coerce')
#Alcohol Abuse data['S2BQ1A3']= pandas.to_numeric(data['S2BQ1A3'], errors='coerce') data['S2BQ1A4']= pandas.to_numeric(data['S2BQ1A4'], errors='coerce') data['S2BQ1A5']= pandas.to_numeric(data['S2BQ1A5'], errors='coerce') data['S2BQ1A6']= pandas.to_numeric(data['S2BQ1A6'], errors='coerce') data['S2BQ1A7']= pandas.to_numeric(data['S2BQ1A7'], errors='coerce') data['S2BQ1A8']= pandas.to_numeric(data['S2BQ1A8'], errors='coerce') data['S2BQ1A9A']= pandas.to_numeric(data['S2BQ1A9A'], errors='coerce') data['S2BQ1A9B']= pandas.to_numeric(data['S2BQ1A9B'], errors='coerce') data['S2BQ1A9C']= pandas.to_numeric(data['S2BQ1A9C'], errors='coerce') data['S2BQ1A9D']= pandas.to_numeric(data['S2BQ1A9D'], errors='coerce') data['S2BQ1A9E']= pandas.to_numeric(data['S2BQ1A9E'], errors='coerce') data['S2BQ1A9F']= pandas.to_numeric(data['S2BQ1A9F'], errors='coerce') data['S2BQ1A9G']= pandas.to_numeric(data['S2BQ1A9G'], errors='coerce') data['S2BQ1A9H']= pandas.to_numeric(data['S2BQ1A9H'], errors='coerce') data['S2BQ1A9I']= pandas.to_numeric(data['S2BQ1A9I'], errors='coerce') data['S2BQ1B9J']= pandas.to_numeric(data['S2BQ1B9J'], errors='coerce') data['S2BQ1C9K']= pandas.to_numeric(data['S2BQ1C9K'], errors='coerce') data['S2BQ1A10']= pandas.to_numeric(data['S2BQ1A10'], errors='coerce') data['S2BQ1A11']= pandas.to_numeric(data['S2BQ1A11'], errors='coerce') data['S2BQ1A12']= pandas.to_numeric(data['S2BQ1A12'], errors='coerce') data['S2BQ1A13']= pandas.to_numeric(data['S2BQ1A13'], errors='coerce') data['S2BQ1A14']= pandas.to_numeric(data['S2BQ1A14'], errors='coerce') data['S2BQ1A15']= pandas.to_numeric(data['S2BQ1A15'], errors='coerce') data['S2BQ1A16']= pandas.to_numeric(data['S2BQ1A16'], errors='coerce') data['S2BQ1A17']= pandas.to_numeric(data['S2BQ1A17'], errors='coerce') data['S2BQ1A18']= pandas.to_numeric(data['S2BQ1A18'], errors='coerce') data['S2BQ1A19']= pandas.to_numeric(data['S2BQ1A19'], errors='coerce') data['S2BQ1A20']= pandas.to_numeric(data['S2BQ1A20'], errors='coerce') data['S2BQ1A21']= pandas.to_numeric(data['S2BQ1A21'], errors='coerce') data['S2BQ1A22']= pandas.to_numeric(data['S2BQ1A22'], errors='coerce') data['S2BQ1A23']= pandas.to_numeric(data['S2BQ1A23'], errors='coerce') data['S2BQ1A24']= pandas.to_numeric(data['S2BQ1A24'], errors='coerce') data['S2BQ1A25']= pandas.to_numeric(data['S2BQ1A25'], errors='coerce') data['S2BQ1A26']= pandas.to_numeric(data['S2BQ1A26'], errors='coerce') data['S2BQ1A27']= pandas.to_numeric(data['S2BQ1A27'], errors='coerce') data['S2BQ1A28']= pandas.to_numeric(data['S2BQ1A28'], errors='coerce') data['S2BQ1A29']= pandas.to_numeric(data['S2BQ1A29'], errors='coerce') data['S2BQ2A']= pandas.to_numeric(data['S2BQ2A'], errors='coerce') data['S2BQ2B']= pandas.to_numeric(data['S2BQ2B'], errors='coerce') data['S2BQ2C']= pandas.to_numeric(data['S2BQ2C'], errors='coerce') data['S2BQ2D']= pandas.to_numeric(data['S2BQ2D'], errors='coerce') data['S2BQ2E']= pandas.to_numeric(data['S2BQ2E'], errors='coerce') data['S2BQ2FR']= pandas.to_numeric(data['S2BQ2FR'], errors='coerce') data['S2BQ3B']= pandas.to_numeric(data['S2BQ3B'], errors='coerce') data['S2BQ3CR']= pandas.to_numeric(data['S2BQ3CR'], errors='coerce')
#Anti-social behaviour data['S11AQ1A1']= pandas.to_numeric(data['S11AQ1A1'], errors='coerce') data['S11AQ1A2']= pandas.to_numeric(data['S11AQ1A2'], errors='coerce') data['S11AQ1A3']= pandas.to_numeric(data['S11AQ1A3'], errors='coerce') data['S11AQ1A4']= pandas.to_numeric(data['S11AQ1A4'], errors='coerce') data['S11AQ1A5']= pandas.to_numeric(data['S11AQ1A5'], errors='coerce') data['S11AQ1A6']= pandas.to_numeric(data['S11AQ1A6'], errors='coerce') data['S11AQ1A7']= pandas.to_numeric(data['S11AQ1A7'], errors='coerce') data['S11AQ1A8']= pandas.to_numeric(data['S11AQ1A8'], errors='coerce') data['S11AQ1A9']= pandas.to_numeric(data['S11AQ1A9'], errors='coerce') data['S11AQ1A10']= pandas.to_numeric(data['S11AQ1A10'], errors='coerce') data['S11AQ1A11']= pandas.to_numeric(data['S11AQ1A11'], errors='coerce') data['S11AQ1A12']= pandas.to_numeric(data['S11AQ1A12'], errors='coerce') data['S11AQ1A13']= pandas.to_numeric(data['S11AQ1A13'], errors='coerce') data['S11AQ1A14']= pandas.to_numeric(data['S11AQ1A14'], errors='coerce') data['S11AQ1A15']= pandas.to_numeric(data['S11AQ1A15'], errors='coerce') data['S11AQ1A16']= pandas.to_numeric(data['S11AQ1A16'], errors='coerce') data['S11AQ1A17']= pandas.to_numeric(data['S11AQ1A17'], errors='coerce') data['S11AQ1A18']= pandas.to_numeric(data['S11AQ1A18'], errors='coerce') data['S11AQ1A19']= pandas.to_numeric(data['S11AQ1A19'], errors='coerce') data['S11AQ1A20']= pandas.to_numeric(data['S11AQ1A20'], errors='coerce') data['S11AQ1A21']= pandas.to_numeric(data['S11AQ1A21'], errors='coerce') data['S11AQ1A22']= pandas.to_numeric(data['S11AQ1A22'], errors='coerce') data['S11AQ1A23']= pandas.to_numeric(data['S11AQ1A23'], errors='coerce') data['S11AQ1A24']= pandas.to_numeric(data['S11AQ1A24'], errors='coerce') data['S11AQ1A25']= pandas.to_numeric(data['S11AQ1A25'], errors='coerce') data['S11AQ1A26']= pandas.to_numeric(data['S11AQ1A26'], errors='coerce') data['S11AQ1A27']= pandas.to_numeric(data['S11AQ1A27'], errors='coerce') data['S11AQ1A28']= pandas.to_numeric(data['S11AQ1A28'], errors='coerce') data['S11AQ1A29']= pandas.to_numeric(data['S11AQ1A29'], errors='coerce') data['S11AQ1A30']= pandas.to_numeric(data['S11AQ1A30'], errors='coerce') data['S11AQ1A31']= pandas.to_numeric(data['S11AQ1A31'], errors='coerce') data['S11AQ1A32']= pandas.to_numeric(data['S11AQ1A32'], errors='coerce') data['S11AQ1A33']= pandas.to_numeric(data['S11AQ1A33'], errors='coerce')
#Family Drug Abuse. data['S3EQ1']= pandas.to_numeric(data['S3EQ1'], errors='coerce') data['S3EQ2']= pandas.to_numeric(data['S3EQ2'], errors='coerce') data['S3EQ3B']= pandas.to_numeric(data['S3EQ3B'], errors='coerce') data['S3EQ3C']= pandas.to_numeric(data['S3EQ3C'], errors='coerce') data['S3EQ4B']= pandas.to_numeric(data['S3EQ4B'], errors='coerce') data['S3EQ4C']= pandas.to_numeric(data['S3EQ4C'], errors='coerce') data['S3EQ5B']= pandas.to_numeric(data['S3EQ5B'], errors='coerce') data['S3EQ5C']= pandas.to_numeric(data['S3EQ5C'], errors='coerce') data['S3EQ6B']= pandas.to_numeric(data['S3EQ6B'], errors='coerce') data['S3EQ6C']= pandas.to_numeric(data['S3EQ6C'], errors='coerce') data['S3EQ7B']= pandas.to_numeric(data['S3EQ7B'], errors='coerce') data['S3EQ7C']= pandas.to_numeric(data['S3EQ7C'], errors='coerce') data['S3EQ8B']= pandas.to_numeric(data['S3EQ8B'], errors='coerce') data['S3EQ8C']= pandas.to_numeric(data['S3EQ8C'], errors='coerce') data['S3EQ9B']= pandas.to_numeric(data['S3EQ9B'], errors='coerce') data['S3EQ9C']= pandas.to_numeric(data['S3EQ9C'], errors='coerce') data['S3EQ10B']= pandas.to_numeric(data['S3EQ10B'], errors='coerce') data['S3EQ10C']= pandas.to_numeric(data['S3EQ10C'], errors='coerce') data['S3EQ11']= pandas.to_numeric(data['S3EQ11'], errors='coerce') data['S3EQ12']= pandas.to_numeric(data['S3EQ12'], errors='coerce') data['S3EQ13']= pandas.to_numeric(data['S3EQ13'], errors='coerce') data['S3EQ14']= pandas.to_numeric(data['S3EQ14'], errors='coerce')
#cannabis use every day in the past 12 months Between ages of 15 to 40 #sub1=data[(data['AGE']>=18) & (data['AGE']<=24) & (data['S3BQ1A5']==1) & (data['S3BQ1A5']==1) & (data['S3BD5Q2C']==1)] #Adults (irrespective of sex) and the age of 60 who have ever used cannabis and who currently use cannabis from once a year to once a day.
sub1=data[(data['AGE']<=60)  & (data['S3BQ1A5']==1)  & (data['S3BD5Q2C']<=10) & (data['S3BD5Q2E']<=10)]
#c1 = sub1["MAJORDEPLIFE"].value_counts(sort=False, dropna=False) #print(c1) c2 = sub1["AGE"].value_counts(sort=False, dropna=False) #print(c2)
def ANTISOC (data):    if(data['S11AQ1A1']==1 or data['S11AQ1A2']==1 or data['S11AQ1A3']==1 or data['S11AQ1A4']==1 or       data['S11AQ1A5']==1 or data['S11AQ1A6']==1 or data['S11AQ1A7']==1 or data['S11AQ1A8']==1 or       data['S11AQ1A9']==1 or data['S11AQ1A10']==1 or data['S11AQ1A11']==1 or data['S11AQ1A12']==1 or       data['S11AQ1A13']==1 or data['S11AQ1A14']==1 or data['S11AQ1A15']==1 or data['S11AQ1A16']==1 or       data['S11AQ1A17']==1 or data['S11AQ1A18']==1 or data['S11AQ1A19']==1 or data['S11AQ1A20']==1 or       data['S11AQ1A21']==1 or data['S11AQ1A22']==1 or data['S11AQ1A23']==1 or data['S11AQ1A24']==1 or       data['S11AQ1A25']==1 or data['S11AQ1A26']==1 or data['S11AQ1A27']==1 or data['S11AQ1A28']==1 or       data['S11AQ1A29']==1 or data['S11AQ1A30']==1 or data['S11AQ1A31']==1 or data['S11AQ1A32']==1 or       data['S11AQ1A33']==1):        return 1    else:        return 0 sub1['ANTISOC'] = sub1.apply (lambda data: ANTISOC (data), axis=1) c7 = sub1["ANTISOC"].value_counts(sort=False, dropna=False) #print(c7)
#both drug dependence and abuse def DRUGDEP (data):    if (data['MAR12ABDEP'] > 0 or data['MARP12ABDEP'] > 0):        return 1    else:        return 0
sub1['DRUGDEP'] = sub1.apply(lambda data: DRUGDEP (data), axis=1) #print (pandas.crosstab(sub1['MAR12ABDEP'], sub1['DRUGDEP']))
def FAMDRUGDEP (data):    if  (data['S3EQ1']== 1 or data['S3EQ2']== 1 or data['S3EQ3B'] < 99 or data['S3EQ3C']== 1 or        data['S3EQ4B'] < 99 or data['S3EQ4C']== 1 or data['S3EQ5B'] < 99 or data['S3EQ5C']== 1 or        data['S3EQ6B'] < 99 or data['S3EQ6C']== 1 or data['S3EQ7B'] < 99 or data['S3EQ7C']== 1 or        data['S3EQ8B'] < 99 or data['S3EQ8C']== 1 or data['S3EQ9B'] < 99 or data['S3EQ9C']== 1 or        data['S3EQ10B'] < 99 or data['S3EQ10C']== 1 or data['S3EQ11']== 1 or data['S3EQ12']== 1 or        data['S3EQ13']== 1 or data['S3EQ14']== 1):        return 1    else:        return 0
sub1['FAMDRUGDEP'] = sub1.apply (lambda data: FAMDRUGDEP (data), axis=1) c11 = sub1["FAMDRUGDEP"].value_counts(sort=False, dropna=False)
def ALCHABUS (data):    if(data['S2BQ1A3']==1 or data['S2BQ1A4']==1 or data['S2BQ1A5']==1 or data['S2BQ1A6']==1 or        data['S2BQ1A7']==1 or data['S2BQ1A8']==1 or data['S2BQ1A9A']==1 or data['S2BQ1A9B']==1 or        data['S2BQ1A9C']==1 or data['S2BQ1A9D']==1 or data['S2BQ1A9E']==1 or data['S2BQ1A9F']==1 or        data['S2BQ1A9G']==1 or data['S2BQ1A9H']==1 or data['S2BQ1A9I']==1 or data['S2BQ1B9J']==1 or        data['S2BQ1C9K']==1 or data['S2BQ1A10']==1 or data['S2BQ1A11']==1 or data['S2BQ1A12']==1 or        data['S2BQ1A13']==1 or data['S2BQ1A14']==1 or data['S2BQ1A15']==1 or data['S2BQ1A16']==1 or        data['S2BQ1A17']==1 or data['S2BQ1A18']==1 or data['S2BQ1A19']==1 or data['S2BQ1A20']==1 or        data['S2BQ1A21']==1 or data['S2BQ1A22']==1 or data['S2BQ1A23']==1 or data['S2BQ1A24']==1 or        data['S2BQ1A25']==1 or data['S2BQ1A26']==1 or data['S2BQ1A27']==1 or data['S2BQ1A28']==1 or        data['S2BQ1A29']==1 or data['S2BQ2A']==1 or data['S2BQ2B']==1 or data['S2BQ2C']==1 or        data['S2BQ2D'] < 99 or data['S2BQ2E'] < 99 or data['S2BQ3B'] < 99 or        data['S2BQ2FR'] < 999 or data['S2BQ3CR'] < 999        ):        return 1    else:       return 0
sub1['ALCHABUS'] = sub1.apply (lambda data: ALCHABUS (data), axis=1) c9 = sub1["ALCHABUS"].value_counts(sort=False, dropna=False)
def FAMALCSM (data):    if(data['S2DQ1']==1 or data['S2DQ2']==1 or data['S2DQ3C2']==1 or data['S2DQ4C2']==1 or       data['S2DQ5C2']==1 or data['S2DQ6C2']==1 or data['S2DQ7C2']==1 or data['S2DQ8C2']==1 or       data['S2DQ9C2']==1 or data['S2DQ10C2']==1 or data['S2DQ11']==1 or data['S2DQ12']==1 or       data['S2DQ13A']==1 or data['S2DQ13B']==1 or data['S2DQ14A']==1 or data['S2DQ14B']==1 or       data['S2DQ15']==1 or data['S2DQ16']==1 or data['S2DQ17']==1 or data['S2DQ18']==1 or       data['S2DQ19']==1 or data['S2DQ21']==1):        return 1    else:        return 0
sub1['FAMALCSM'] = sub1.apply (lambda data: FAMALCSM (data), axis=1) c8 = sub1["FAMALCSM"].value_counts(sort=False, dropna=False) #print(c8)
# logistic regression BETWEEN Anti-Social Behaviour and Drug Dependence / Abuse  lreg9 = smf.logit(formula = 'ANTISOC ~ DRUGDEP', data = sub1).fit() print (lreg9.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg9.params))
# odd ratios with 95% confidence intervals params = lreg9.params conf = lreg9.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
###
# logistic regression BETWEEN Anti-Social Behaviour and Drug Depencence with the ABSENCE OR PRESENCE OF Historical FAMILY Drug Dependence lreg15 = smf.logit(formula = 'ANTISOC ~ DRUGDEP + FAMDRUGDEP', data = sub1).fit() print (lreg15.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg15.params))
# odd ratios with 95% confidence intervals params = lreg15.params conf = lreg15.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
###
# logistic regression BETWEEN ABSENCE OR PRESENCE OF Anti Social Behaviour and Drug Dependence and Alcohol Abuse lreg12 = smf.logit(formula = 'ANTISOC ~ DRUGDEP + ALCHABUS', data = sub1).fit() print (lreg12.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg12.params))
# odd ratios with 95% confidence intervals params = lreg12.params conf = lreg12.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression BETWEEN ABSENCE OR PRESENCE OF Anti Social Behaviour and Drug Dependency and Alcohol Abuse with History of Family Alcoholism lreg13 = smf.logit(formula = 'ANTISOC ~ DRUGDEP + ALCHABUS + FAMALCSM', data = sub1).fit() print (lreg13.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg13.params))
# odd ratios with 95% confidence intervals params = lreg13.params conf = lreg13.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
## logistic regression BETWEEN ABSENCE OR PRESENCE OF Anti Social Behaviour and Family Alcholism and Drug Dependency. #lreg11 = smf.logit(formula = 'ANTISOC ~ FAMALCSM + DRUGDEP', data = sub1).fit() #print (lreg11.summary()) ## odds ratios #print ("Odds Ratios") #print (numpy.exp(lreg11.params)) # ## odd ratios with 95% confidence intervals #params = lreg11.params #conf = lreg11.conf_int() #conf['OR'] = params #conf.columns = ['Lower CI', 'Upper CI', 'OR'] #print (numpy.exp(conf))
0 notes
datascienceprojects-blog · 7 years ago
Text
Multiple Regression Test Of Employment Rate, Urban Rate, Alcohol Consumption and Suicide Rates  on Life Expectancy.
This submission has the following sections: 
- The Research Question
-  The Dataset
-  Methodology
-  Results
-  Charts
-  Regression (output from Python)
-  Python Code
Research Question: The study looked at the association between countries' declared Employment Rates, Urban Rates, Alcohol Consumption Rates and Suicide Rates on the declared Life expectancies as reported in the Gapminder dataset.
Dataset: GapMinder.
Methodology
The data types were converted to numeric using pandas libraries and centred. A linear regression was performed on the data using the OLS libraries to first of all establish a linear regression between employment rate (explanatory variable) and life expectancy (response variable). 
Other Independent variables were added to the model to determine any confounding variables. The significance of the association was noted via the P-Values (> 0.05) and the coefficients (Positive or negative).
The strength of the association was measured by R-Squared which gives an indication of how much the response variable changed with the exploratory variable.
The independent variables were added step-wise to the model and the OLS - Regression results printed to indicate how the variable modified association if at all.
A graph was printed to depict the association and a quadratic line representing the second order polynomial was drawn.
Results  There was a negative and significant relationship between the countries’ Employment Rate and Life Expectancy.  (Please see Chart 1 and Regression Output below.)
The results showed Employment Rate and Life expectancy had a negative coefficient (-0.2987) and a significant P value (P < 0.05) . The R-square (the amount of variability in the response variable that is explained by the explanatory variable) of the relationship was just 9.6%
Further analysis with a second order polynomial showed the relationship was more negatively curvilinear than linear, thus countries with low and high employment rates tended to have low life expectancies. The R-Square improved to 13.9%
When controlling for urban rate in the relationship between employment rate and life expectancy, it was observed that employment rate was no longer significantly associated with life expectancy. (P > 0.05). As the P-Value for employment rate is no longer significant when urban rate is included in the model, it can be said that urban rate confounds the relationship between employment rate and life expectancy.
When evaluating the independent association between other predictor variables such as alcohol consumption and suicide rates it was observed that recorded alcohol consumption rates for each country were positively and significantly associated with their declared life expectancy rates however although the suicide rates had a negative association it was a non-significant association with the recorded life expectancies, (P > 0.05).
The intercept or life expectancy in the model when all other independent variables were zero or at their means was at 69.5.
Together all the exploratory variables, employment rate, urban rate, alcohol consumption rate and suicide rates made up only about 47.5% of the variability observed in the declared life expectancy rates for each country.
The QQ Plot plots the residuals The residuals are the difference between the predicted Life expectancies and the actual life expectancies for each country. These residuals are expected to follow a straight line if they are normally distributed (a key assumption). However the investigated plot shows slight deviations at the upper and lower levels, suggesting that while an improvement from the linear model, the curvilinear model still does not fully estimate the life expectancies for each country using the given explanatory variables. (Please see Chart 2 below)
The Standardised Residuals
Only a very few countries (2) had standard deviations above 2, while just about (5) have residuals below (-2), making it 95.8% of residuals falling within two standard deviations of the mean (higher than the recommended 95%). The fact that, 4.2% of the residuals were greater than 2.0 suggests the model could be improved by perhaps adding more explanatory variables. (Please see Chart 3 below)
The Leverage Plot
The leverage plot there are a few outliers, i.e. those countries which have residuals greater than 2 or less than -2 however these (except for one, South Africa) have low leverage values. (Please see Chart 4 below)
Charts
Chart 1: Multiple Regression
Tumblr media
Chart 2: The QQ Plot plots the residuals
Tumblr media
Chart 3: The Standardised Residuals
Tumblr media
 Chart 4: The Leverage Plot
Tumblr media
Appendix I - Regression Results - Output From Python
Tumblr media
Appendix II - Python Code
# -*- coding: utf-8 -*- """ Created on Sunday: 16th September 2018
@author: Lloney Monono """
import pandas import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf import seaborn
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x)
data = pandas.read_csv('gapminder.csv')
# convertions to numeric datatype
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['suicideper100th'] = pandas.to_numeric(data['suicideper100th'], errors='coerce') data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce') data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
# Rm of Nas sub1 = data[['alcconsumption', 'lifeexpectancy', 'suicideper100th','employrate', 'urbanrate']].dropna()
#################################################################################### # POLYNOMIAL REGRESSION ####################################################################################
# Establish a linear relationships between Urban Rate & life exp scat1 = seaborn.regplot(x="employrate", y="lifeexpectancy", scatter=True, data=sub1) plt.xlabel('Employment Rate') plt.ylabel('Life Expectancy')
## use the second order polynomial *order = 2* # run the 2 scatterplots together to get both linear and second order fit lines scat2 = seaborn.regplot(x="employrate", y="lifeexpectancy", scatter=True, order=2, data=sub1) plt.xlabel('Employment Rate') plt.ylabel('Life Expectancy')
## centering the independent (explanatory) variables Urban Rate, alcohol consumption, suicide rates per 100 ## & employemnt rate sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean()) sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean()) sub1['suicideper100th_c'] = (sub1['suicideper100th'] - sub1['suicideper100th'].mean()) sub1['employrate_c'] = (sub1['employrate'] - sub1['employrate'].mean()) sub1[["alcconsumption_c", "suicideper100th_c"]].describe()
# Perform a linear regression analysis between life expectancy & Employment Rate reg1 = smf.ols('lifeexpectancy ~ employrate_c', data=sub1).fit() print (reg1.summary())
# quadratic (polynomial) regression analysis between Life expectancy & employment rate #using the square of the exlanatory variable employment rate reg2 = smf.ols('lifeexpectancy ~ employrate_c + I(employrate_c**2)', data=sub1).fit() print (reg2.summary())
# quadratic (polynomial) regression analysis between Life expectancy & employment rate ## controlling for Urban RAte & employment rate reg3 = smf.ols('lifeexpectancy ~ employrate_c + I(employrate_c**2) + urbanrate_c', data=sub1).fit() print (reg3.summary())
# quadratic (polynomial) regression analysis between Life expectancy and a compbination of ## (Urban Rate, alcohol Consumption ) reg4 = smf.ols('lifeexpectancy ~ employrate_c + I(employrate_c**2) + urbanrate_c + alcconsumption_c', data=sub1).fit() print (reg4.summary())
# quadratic (polynomial) regression analysis between Life expectancy and a Employment Rate ## while controlling for Urban Rate , alcohol Consumption & suicide rate per 100) reg5 = smf.ols('lifeexpectancy  ~ employrate_c + I(employrate_c**2) + urbanrate_c + alcconsumption_c +  suicideper100th_c', data=sub1).fit()
print (reg5.summary())
#################################################################################### # EVALUATING MODEL FIT ####################################################################################
#Q-Q plot for normality fig2=sm.qqplot(reg5.resid, line='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')
## simple plot of residuals stdres=pandas.DataFrame(reg5.resid_pearson) fig3=sm.qqplot(stdres) l = plt.axhline(y=0, color='r') plt.ylabel('Std. Residual') plt.xlabel('Observations')
#Leverage Plots fig4=sm.graphics.influence_plot(reg5,size=8) print(fig4)
0 notes
datascienceprojects-blog · 7 years ago
Text
Data Science Assignment:-  Test A Basic Linear Model
The association between Employment Rate (the explanatory variable) and Life Expectancy (the Response variable) was tested using data from the Gap Minder Dataset.
Data:
The Gap Minder Dataset was used and both the response variable and the explanatory variables are quantitative variables.
The response variable is : Lifeexpectancy The exploratory varable is: empoyment rate
The number of observations used in the study is : 176
Methodology
Python script calling pandas, numpy and statsmodels libraries.
Results
The results of the regression model indicted that life expectancy was significantly and negatively associated with employment rate. (Fstatistic 20.66 and P-value less than .0001)
The P value is very small lower than 0.05 therefore we can reject the null hypothesis and suggest that Employment rate is significantly associated (negatively) with Life expectancy.
R-squared, The proportion of variance in the response variable that can be explained by the explanatory variable, is small, accounting for just 11% seen in the response variable life expectancy.
The regression model’s best fit line for the response variable Life Expectancy is: 
 87.66 +(-0.3085) * employmentrate
Appendix I: Output from the Python Script: 
Tumblr media Tumblr media
Appendix II: The Python Script
# -*- coding: utf-8 -*- """ Created on Sunday 30th Aug. 2018
@author: Lloney Monono """
import pandas import numpy import statsmodels.formula.api as smf import seaborn import matplotlib.pyplot as plt
data = pandas.read_csv('C:\\...\\gapminder.csv', low_memory=False)
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce') data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
print('OLS Regression Model between Employment Rate (Categorical Variable) and Life Expectancy (Response Variable)')
olsreg = smf.ols('data.lifeexpectancy ~ data.employrate', data).fit()
print(olsreg.summary())
data['employrate']=data['employrate'].replace(' ', numpy.nan) scat1 = seaborn.regplot(x='employrate', y='lifeexpectancy', fit_reg=True, data=data) plt.xlabel('Employment Rate') plt.ylabel('Life Expectancy') plt.title('Scatterplot for Employment Rate & Life Expectancy') data_clean=data.dropna()
0 notes
datascienceprojects-blog · 7 years ago
Text
Data Science Assignment
Analysing The Arrest Log For Providence Police Department.
Step 1: Describe your sample.
a) Describe the study population (who or what was studied).
The study population is Adults arrested and charged by Providence Police Department Arrest during the Past 60 Days.
b) Report the level of analysis studied (individual, group, or aggregate).
The analysis seeks to establish if there was any link between an arrested individual's race or ethnicity and recorded Assault offences.
H0: An individual's race or ethnicity is independent of recorded assault offences. H1: There is a relationship between an individual's race or ethnicity and recorded assault offences.
c) Report the number of observations in the data set.
The report comprises 1160 arrest records.
d) Describe your data analytic sample (the sample you are using for your analyses).
The data sample comprises 140 Assault offences attributed to individuals of White or Black race and Hispanic or Non-Hispanic ethnicity.
Step 2 Describe the procedures that were used to collect the data.
a) Report the study design that generated that data (for example: data reporting, surveys, observation, experiment).
Data was generated from recorded arrests by the Providence Police Department.  
b) Describe the original purpose of the data collection.
The original purpose of the data was to provide a log of all individuals arrested for Rhode Island Statute Violations.
c) Describe how the data were collected.
The data was captured from the arrested individuals in Providence
d) Report when the data were collected.
The data in the report has been captured 60 days prior to 18th May 2018 and encompasses the months March, April and May.
e) Report where the data were collected.
The data was collected by Providece Police Department in Providence Rhode Island and put online for public access.
Step 3: Describe your variables.
a) Describe what your explanatory and response variables measured.
There are two categorical variables Race / Ethnicity and Statute Description
Explanatory Variables are Race and Ethnicity while the response variable is the Statute Description (Assault)
b) Describe the response scales for your explanatory and response variables.
The explanatory variables are factors with two levels each for example race (Whilte / Black) and ethnicity (Hispanic / non-Hispanic) The response variables is a count of assault crimes (any statute-violation description that has the word "Assault")
c) Describe how you managed your explanatory and response variables.
To perform a Chi-Square Test Of Independence a two-way table is constructed from the data.
0 notes
datascienceprojects-blog · 7 years ago
Text
Blog For Assignments From Data Science Toolkit Course.
Assignment-1 ANOVA
Examined the NESARC data to determine the association between the mean number of cannabis joints consumed (Quantitative Response Variable) and the declared ethnicities. 
This submission is split into the following sections:
 Hypotheses
 Data
 Methodology
 Results
 Summary
 Code (Appendix I)
 Output from Code (Appendix II)
Hypothesis :
Null Hypothesis: There are no significant differences between the means of monthly cannabis joints smoked by respondents aged 18 - 60 and their declared ethnicities.
Alternate Hypothesis: There are significant differences between the means of monthly cannabis joints smoked by respondents aged 18 - 60 and their declared ethnicities.
Data
The as subset of NESARC data was used, which showed respondents of all declared ethnicities between the ages of 18-60 years who had smoked cannabis in the last 12 months. 
Exploratory Variable - Declared Ethnicities (5 Levels)
Response Variable - The mean monthly cannabis joints had in the last 12 months by respondents aged between 18 - 60.
Methodology
The number of days a joint was had in a month was multiplied by the declared number of joints had per day in the last 12 months to work out the number of joints consumed in a month.
ANOVA was performed between the Exploratory Variable Declared Ethnicities and the Response Variable ( the average number of joints smoked by a subset of respondents aged 18 - 60). 
The F-statistic and associated P-value were calculated  using the Ordinary Least Squares (OLS) function from the statsmodel library. 
Post Hoc tests were run using Tukey’s Multiple Comparison Of Means.
Results
American Indians and Alaska Natives (Group 3) were on the average, the highest users of Cannabis, consuming 75.4 joints per month. Those who described themselves as White (Group 1) were on average the least consumers of Cannabis joints. 
Differences between the means were observed across all groups. The Mean Differences ranged from (-2.47) between Group 4 & 5 (i.e. between Asian / Pacific Hawaiian and Hispanic or Latino)  to 37.6 between Groups 1 and 3 (i.e. between White and American Indian.
The F-Statistic was 3.451 and the P-Value was 0.00824. 
Summary
The P-Value is below 0.05 (P < 0.05) so we reject the null hypothesis of no significant differences between the means for monthly cannabis joints consumed and the respondents’ declared ethnicities.
To determine how the means were different from each other, a Tukey Post Hoc Multiple Comparison of Means Test was performed. The Post Hoc tests showed that in all but the difference in the means between ethnic Group 1 (White) and Group 2 (Black), where the null-hypothesis could be rejected, there were no significant differences between the pair-wise comparisons of the means of monthly Cannabis consumption for the declared ethnicities.
Appendix I
Code:
# -*- coding: utf-8 -*- """ Created on Sun. 1st April 2018
@author: Lloney Monono """
import numpy import pandas import statsmodels.formula.api as statsmodform import statsmodels.stats.multicomp as statsmodmulti
data = pandas.read_csv('C:\\Users\\Llone\\Documents\\Data Scientists Toolbox\\Python\\nesarc.csv', low_memory=False)
#How often used cannabis in the last 12 months casting to numeric data['S3BD5Q2C'] = data['S3BD5Q2C'].convert_objects(convert_numeric=True) #Number Of Cannabis Joints Usually consumed in a day in the last 12 Months data['S3BQ3'] = data['S3BQ3'].convert_objects(convert_numeric=True)
# A subset data of adults age 18 to 60 who have had at least one joint every month # in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=60) & (data['S3BD5Q2C'] <= 6 ) ]
#Missing data not to be included in analysis sub1['S3BD5Q2C']=sub1['S3BD5Q2C'].replace(99, numpy.nan) sub1['S3BQ3']=sub1['S3BQ3'].replace(99, numpy.nan)
#Recoding number of days a joint was had in the past month. Ignoring anything # below 1x a month. See aove condition <= category 6 recode1 = {1: 30, 2: 24, 3: 16, 4: 8, 5: 3, 6: 1} sub1['JOINTFREQMO']= sub1['S3BD5Q2C'].map(recode1)
#Casting variable JOINTFREQMO to numeric sub1['JOINTFREQMO']= sub1['JOINTFREQMO'].convert_objects(convert_numeric=True)
# Estimating the number of Joints consumed in a month by multiplying the frequency by the # frequency by the number of joints per day sub1['EST_JOINTCOUNT']=sub1['JOINTFREQMO'] * sub1['S3BQ3']
sub1['EST_JOINTCOUNT']= sub1['EST_JOINTCOUNT'].convert_objects(convert_numeric=True)
ct1 = sub1.groupby('EST_JOINTCOUNT').size() print (ct1)
#Looking at More than two levels ANOVA using Ethnicity (sub3) and Marital status (sub4) # first Ethnicity sub3 = sub1[['EST_JOINTCOUNT', 'ETHRACE2A']].dropna()
model2 = statsmodform.ols(formula='EST_JOINTCOUNT ~ C(ETHRACE2A)', data=sub3).fit() print (model2.summary())
print ('means for EST_JOINTCOUNT by by Declared Ethnicity') m2= sub3.groupby('ETHRACE2A').mean() print (m2)
print ('Standard deviations for EST_JOINTCOUNT by Declared Ethnicity') sd2 = sub3.groupby('ETHRACE2A').std() print (sd2)
#Post Hoc Test using Tukey mc1 = statsmodmulti.MultiComparison(sub3['EST_JOINTCOUNT'], sub3['ETHRACE2A']) res1 = mc1.tukeyhsd() print(res1.summary())
Appendix II
Output:
dtype: int64                             OLS Regression Results                            ============================================================================== Dep. Variable:         EST_JOINTCOUNT   R-squared:                       0.014 Model:                            OLS   Adj. R-squared:                  0.010 Method:                 Least Squares   F-statistic:                     3.451 Date:                Sun, 01 Apr 2018   Prob (F-statistic):            0.00824 Time:                        15:53:51   Log-Likelihood:                -5913.5 No. Observations:                 978   AIC:                         1.184e+04 Df Residuals:                     973   BIC:                         1.186e+04 Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                         coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------------- Intercept            37.8015      4.376      8.638      0.000      29.214      46.389 C(ETHRACE2A)[T.2]    27.7614      8.484      3.272      0.001      11.112      44.411 C(ETHRACE2A)[T.3]    37.6039     17.415      2.159      0.031       3.428      71.780 C(ETHRACE2A)[T.4]    12.3860     26.004      0.476      0.634     -38.645      63.417 C(ETHRACE2A)[T.5]     9.9161      8.863      1.119      0.263      -7.476      27.308 ============================================================================== Omnibus:                     1458.578   Durbin-Watson:                   2.077 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           608052.290 Skew:                           8.478   Prob(JB):                         0.00 Kurtosis:                     123.971   Cond. No.                         8.28 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for EST_JOINTCOUNT by by Declared Ethnicity            EST_JOINTCOUNT ETHRACE2A                1               37.801457 2               65.562814 3               75.405405 4               50.187500 5               47.717514 Standard deviations for EST_JOINTCOUNT by Declared Ethnicity            EST_JOINTCOUNT ETHRACE2A                1               75.687860 2              156.727470 3              139.172846 4               96.341904 5               88.870305 Multiple Comparison of Means - Tukey HSD,FWER=0.05 =============================================== group1 group2 meandiff   lower    upper  reject -----------------------------------------------   1      2    27.7614    4.5751  50.9476  True   1      3    37.6039   -9.9903  85.1982 False   1      4     12.386   -58.6814 83.4535 False   1      5     9.9161   -14.3047 34.1368 False   2      3     9.8426   -40.3248  60.01  False   2      4    -15.3753  -88.1911 57.4404 False   2      5    -17.8453  -46.797  11.1064 False   3      4    -25.2179 -109.0615 58.6257 False   3      5    -27.6879  -78.3417 22.9659 False   4      5     -2.47    -75.6217 70.6818 False -----------------------------------------------
0 notes