This blog is dedicated to the Coursera's Data Analysis and Interpretation course by Wesleyan University
Don't wanna be here? Send us removal request.
Text
Capstone Project Assignment 2 : Methods
Sample :
The sample includes N = 891 passengers detail of Titanic ship.
Measures :
There are total eleven variables(or column) in Titanic dataset.
Survived : 0 = No and 1 = Yes pclass : Passenger class is a proxy for socio-economic status (SES)\ 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower name : Name of passenger sex : Gender of passenger age : Age of passenger. Age is in Years.Fractional if Age less than One (1) sibsp : Number of Siblings/Spouses Aboard parch : Number of Parents/Children Aboard ticket : Ticket Number fare : Passenger Fare cabin : Cabin of ship embarked :Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southmpton)
Out of eleven variables, 'Survived' is target variable. Now to choose explanatory variables, we will choose only pclass, sex, age, embarked, sibsp and parch.
Passenger class is a proxy for socio-economic status. So upper class being rich or powerful have easy access to lifeboats and hence more chance of survival compared to other class.
Females are more likely to get lifeboat than the male passenger hence more probability of survival. Similarly, children are more likely to be boarded on lifeboats while old peoples who have already lived most of their life are less likely to board the lifeboats.
The importance of embarked, parch and sibsp variables will be visualized in next section.
Including variable 'name' in the explanatory variable will not make a sense because we can not interpret survival of person from their names excluding some particular names of celebrities, politician or a wealthy businessman. Similarly including variable 'ticket' in to explanatory variable will not make sense as mostly ticket numbers will be unique and also their are many missing data which is very difficult to fill in.
Cabin can be included but it have many missing data and filling those missing data is very tough.
Variable 'fare' seems like that it should be included in explanatory variable but if you analyse this variale deeply, you will realise that including this variable will result in bad model as 'fare' is confounded by 'age', 'embarked', 'pclass' or may be 'sex'. Fare of a passenger will depend on age (children have low fare), port of embarkation(l the longer you travel more you have to pay), passenger class(higher is the class more is the money) and also there can be a case that females were given some discounts. So 'fare' contains the information other variable which we have already included.
There are some missing data in variable 'age' which is set to the median value of age. Similarly missing data in 'embarked' are set to 'S' as 'S' has the highest frequency. Quantitative variable age is categorized into five categories - CHILDREN, ADOLESCENTS, ADULTS, MIDDLE AGE, OLD for better visualization.
Analysis :
The distributions for the predictors were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables. Scatter plots were also examined, and Pearson Correlation was used to test bivariate associations between individual predictors and the response variable.
Random forest algorithm was used to predict the survival of passengers. The random forest model was estimated on a training data set consisting of a random sample of 70%. A test data set included the other 30% of the sample. Confusion matrix was used for evaluating the model.
0 notes
Text
Capstone Project Assignment :- 1
Research Question:
Analysis and prediction of survival on the Titanic.
The purpose of this study is to first analyze and then predict the survival on the Titanic.First analysis will be done to find out how different factors are involved in survival of passengers on the Titanic. Then after analyzing the factors prediction will be made.
Data for this research is taken from Kaggle competition which can be found here
0 notes
Text
Machine Learning For Data Analysis Week 4 : K Means Clustering
Code:
from pandas import Series, DataFrame import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.cluster import KMeans
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
#Setting variable to numeric. data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce') data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'], errors='coerce') data['S2DQ3C1'] = pandas.to_numeric(data['S2DQ3C1'], errors='coerce') data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce') data['S2DQ4C1'] = pandas.to_numeric(data['S2DQ4C1'], errors='coerce') data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce') data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce') data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')
#subset data to the age 10 to 30 when started drinking sub1=data[((data['S2AQ16A']>=10) & (data['S2AQ16A']<=30))] #Copy new DataFrame sub2 = sub1.copy()
#Recording missing data sub2['S2AQ16A'] = sub2['S2AQ16A'].replace(99, numpy.nan) sub2['S2DQ3C1'] = sub2['S2DQ3C1'].replace(99, numpy.nan) sub2['S2DQ3C2'] = sub2['S2DQ3C2'].replace(9, numpy.nan) sub2['S2DQ4C1'] = sub2['S2DQ4C1'].replace(99, numpy.nan) sub2['S2DQ4C2'] = sub2['S2DQ4C2'].replace(9, numpy.nan) sub2['S2DQ1'] = sub2['S2DQ1'].replace(9, numpy.nan) sub2['S2DQ2'] = sub2['S2DQ2'].replace(9, numpy.nan)
#creating a secondary variable for calculating sibling number. sub2['SIBNO'] = sub2['S2DQ3C1'] + sub2['S2DQ4C1']
#defining new variable for sibling drinking status by combining data of brothers and sisters def SIBSTS(row): if any([row['S2DQ3C2'] == 1, row['S2DQ4C2'] == 1]) : return 1 elif all([row['S2DQ3C2'] == 2, row['S2DQ4C2'] == 2]) : return 0 else : return numpy.nan sub2['SIBSTS'] = sub2.apply(lambda row: SIBSTS (row),axis=1)
#defining new variable for parent status status of drinking def PRSTS(row): if any([row['S2DQ1'] == 1, row['S2DQ2'] == 1]) : return 1 elif all([row['S2DQ1'] == 2, row['S2DQ2'] == 2]) : return 0 else : return numpy.nan sub2['PRSTS'] = sub2.apply(lambda row: PRSTS (row),axis=1)
#recoding values for 'CONSUMER' into a new variable, DRSTS recode1 = {1: 1, 2: 0, 3: 0} sub2['DRSTS']= sub2['CONSUMER'].map(recode1)
#recoding new values for SEX variable recode2 = {1: 1, 2: 0} sub2['GEN']= sub2['SEX'].map(recode2)
data_clean = sub2.dropna() """ Modeling and Prediction """ #Split into training and testing sets
pred = data_clean[['SIBNO','SIBSTS','PRSTS','GEN','S2AQ16A']]
clustervar=pred.copy() from sklearn import preprocessing
clustervar['SIBNO']=preprocessing.scale(clustervar['SIBNO'].astype('float64')) clustervar['SIBSTS']=preprocessing.scale(clustervar['SIBSTS'].astype('float64')) clustervar['PRSTS']=preprocessing.scale(clustervar['PRSTS'].astype('float64')) clustervar['GEN']=preprocessing.scale(clustervar['GEN'].astype('float64')) clustervar['S2AQ16A']=preprocessing.scale(clustervar['S2AQ16A'].astype('float64'))
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(numpy.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
# Interpret 4 cluster solution model3=KMeans(n_clusters=4) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
""" BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pandas.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """
# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
# validate clusters in training data by examining cluster differences in GPA using ANOVA # first have to merge GPA with clustering variables and cluster assignment data gpa_data=data_clean['DRSTS'] # split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pandas.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pandas.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['DRSTS', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='DRSTS ~ C(cluster)', data=sub1).fit() print (gpamod.summary())
print ('means for GPA by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['DRSTS'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
Output :
A k-means cluster analysis was performed to identify underlying subgroups based on their similarity of responses on 5 variables that represent characteristics that could have an impact on drinking status of individuals
. Clustering variables included three binary variables PRSTS (Parents drinking status), SIBSTS (Sibling drinking status), GEN (gender) and two quantitative variables SIBNO(No.of siblings who drinks) and age at onset of drinking( S2AQ16A )
As possible cases are 3, 4, 6, and 8 but I chose 4 clustering group.
Cluster is not good as it can be seen from figure.
Clustering variable means by cluster index SIBNO SIBSTS PRSTS GEN S2AQ16A cluster 0 21945.028524 -0.388644 -0.494159 -0.571713 -0.957500 0.255108 1 21747.304583 -0.388644 -0.494159 -0.571713 1.044387 -0.109161 2 21379.038997 -0.388644 -0.494159 1.749130 -0.106187 -0.162766 3 20697.274571 1.578009 2.023639 0.516130 -0.095059 -0.135573
OLS Regression Results ============================================================================== Dep. Variable: DRSTS R-squared: 0.008 Model: OLS Adj. R-squared: 0.008 Method: Least Squares F-statistic: 57.62 Date: Wed, 01 Feb 2017 Prob (F-statistic): 4.36e-37 Time: 19:32:35 Log-Likelihood: -10854. No. Observations: 20885 AIC: 2.172e+04 Df Residuals: 20881 BIC: 2.175e+04 Df Model: 3 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept 0.7672 0.005 153.863 0.000 0.757 0.777 C(cluster)[T.1] 0.0644 0.007 9.202 0.000 0.051 0.078 C(cluster)[T.2] 0.0403 0.009 4.624 0.000 0.023 0.057 C(cluster)[T.3] -0.0318 0.008 -3.952 0.000 -0.048 -0.016 ============================================================================== Omnibus: 3951.048 Durbin-Watson: 2.013 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6758.200 Skew: -1.393 Prob(JB): 0.00 Kurtosis: 2.991 Cond. No. 4.44 ==============================================================================
means for DRSTS by cluster DRSTS cluster 0 0.767152 1 0.831582 2 0.807490 3 0.735330
standard deviations for DRSTS by cluster DRSTS cluster 0 0.422678 1 0.374264 2 0.394332 3 0.441211
Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff lower upper reject --------------------------------------------- 0 1 0.0644 0.0464 0.0824 True 0 2 0.0403 0.0179 0.0628 True 0 3 -0.0318 -0.0525 -0.0111 True 1 2 -0.0241 -0.0464 -0.0018 True 1 3 -0.0963 -0.1168 -0.0757 True 2 3 -0.0722 -0.0967 -0.0476 True ---------------------------------------------
0 notes
Text
Machine Learning For Data Analysis Week 3 : Lasso Regression
Code :
import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
#Setting variable to numeric. data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce') data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'], errors='coerce') data['S2DQ3C1'] = pandas.to_numeric(data['S2DQ3C1'], errors='coerce') data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce') data['S2DQ4C1'] = pandas.to_numeric(data['S2DQ4C1'], errors='coerce') data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce') data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce') data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')
#subset data to the age 10 to 30 when started drinking sub1=data[((data['S2AQ16A']>=10) & (data['S2AQ16A']<=30))] #Copy new DataFrame sub2 = sub1.copy()
#Recording missing data sub2['S2AQ16A'] = sub2['S2AQ16A'].replace(99, numpy.nan) sub2['S2DQ3C1'] = sub2['S2DQ3C1'].replace(99, numpy.nan) sub2['S2DQ3C2'] = sub2['S2DQ3C2'].replace(9, numpy.nan) sub2['S2DQ4C1'] = sub2['S2DQ4C1'].replace(99, numpy.nan) sub2['S2DQ4C2'] = sub2['S2DQ4C2'].replace(9, numpy.nan) sub2['S2DQ1'] = sub2['S2DQ1'].replace(9, numpy.nan) sub2['S2DQ2'] = sub2['S2DQ2'].replace(9, numpy.nan)
#creating a secondary variable for calculating sibling number. sub2['SIBNO'] = sub2['S2DQ3C1'] + sub2['S2DQ4C1']
#defining new variable for sibling drinking status by combining data of brothers and sisters def SIBSTS(row): if any([row['S2DQ3C2'] == 1, row['S2DQ4C2'] == 1]) : return 1 elif all([row['S2DQ3C2'] == 2, row['S2DQ4C2'] == 2]) : return 0 else : return numpy.nan sub2['SIBSTS'] = sub2.apply(lambda row: SIBSTS (row),axis=1)
#defining new variable for parent status status of drinking def PRSTS(row): if any([row['S2DQ1'] == 1, row['S2DQ2'] == 1]) : return 1 elif all([row['S2DQ1'] == 2, row['S2DQ2'] == 2]) : return 0 else : return numpy.nan sub2['PRSTS'] = sub2.apply(lambda row: PRSTS (row),axis=1)
#recoding values for 'CONSUMER' into a new variable, DRSTS recode1 = {1: 1, 2: 0, 3: 0} sub2['DRSTS']= sub2['CONSUMER'].map(recode1)
#recoding new values for SEX variable recode2 = {1: 1, 2: 0} sub2['GEN']= sub2['SEX'].map(recode2)
data_clean = sub2.dropna() """ Modeling and Prediction """ #Split into training and testing sets
pred = data_clean[['SIBNO','SIBSTS','PRSTS','GEN']] target = data_clean['S2AQ16A']
predictors=pred.copy() from sklearn import preprocessing
predictors['SIBNO']=preprocessing.scale(predictors['SIBNO'].astype('float64')) predictors['SIBSTS']=preprocessing.scale(predictors['SIBSTS'].astype('float64')) predictors['PRSTS']=preprocessing.scale(predictors['PRSTS'].astype('float64')) predictors['GEN']=preprocessing.scale(predictors['GEN'].astype('float64'))
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))
# plot coefficient progression m_log_alphas = -numpy.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold m_log_alphascv = -numpy.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)
OUTPUT :
Regression coefficient :
GEN : -0.64438534614104892 PRSTS : -0.38464097596963004 SIBNO : -0.057518355093678704 SIBSTS : -0.10840676158247332
training data MSE : 11.1956810873 test data MSE : 11.5041705471 training data R-square : 0.0485507314614 test data R-square : 0.0454230302836
Conclusion :
A Lasso Regression analysis is conducted to identify subset of best predictors out of total of four predictors. Target variable is a quantitative variable which describe age at onset of alcohol consumption. Predictors variable are SIBSTS (Siblings’ drinking status), PRSTS(parents’ drinking status), GEN(GENDER) these all three are categorical variable and one quantitative variable SIBNO(Number of sibling who drinks).
All predictors variable are standardise to have mean equal to zero and standard deviation equal to 1.
Data is split in to 70% of test data and 30% of test data. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
After running lasso regression it is found that all predictor variables have negative coefficient and none of them have zero coefficient value.GEN have maximum coefficient value while SIBNO have least.
Model is not good as only 4% variability is explained by this model.
0 notes
Text
Machine Learning For Data Analysis Week 2 : Random Forest
Code :
predictors = data_clean[['S2AQ16A','SIBNO','SIBSTS','PRSTS','GEN']]
targets = data_clean['DRSTS'].astype('category')
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(pred_train.shape) print(pred_test.shape) print(tar_train.shape) print(tar_test.shape)
Output :
(17902, 5) (11935, 5) (17902,) (11935,)
Code :
#Build model on training data from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions))
Output :
[[ 36 2489] [ 95 9315]]
0.783493925429
Code :
# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_)
Output :
[ 0.54571632 0.24291864 0.05134129 0.05579401 0.10422973]
Code :
""" Running a different number of trees and see the effect of that on the accuracy of the prediction """
trees=range(25) accuracy=numpy.zeros(25)
for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla() plt.plot(trees, accuracy)
Output :
Conclusion :
Aim is to find important explanatory variable for predicting binary response variable ‘DRSTS’ (Drinking Status).
The explanatory variables with the highest relative importance scores were 'S2AQ16A ‘(Age at onset of alcohol consumption), ‘SIBNO’(No. of siblings who drink) and ‘GEN’(sex). The accuracy of the random forest was 78%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.
0 notes
Text
Machine Learning For Data Analysis Week 1 : Decision Tree
My target variable ‘DRSTS’ is a categorical variable and it describes drinking status.
Predictor variable are :-
SIBSTS : Sibling drinking status
PRSTS : Parents drinking status
GEN : Gender
SIBNO : No. of siblings who drink
S2AQ16 : Age at onset of drinking
After running decision tree classifier I got accuracy 1.0 and only one element in confusion matrix. I also got only one node in decision tree classifier image.
Please help me where I am making mistakes. Code and Output is below.
code :
Output :
Code output
Shape of training and test data
(17902, 5) (11935, 5) (17902L,) (11935L,)
Confusion matrix and classification accuracy : [[11935]] 1.0
Decision tree graph :
0 notes
Text
Regression Modelling in Practice Week 4 - Logistic Regression : Test a Logistic Regression Model
Output :
Logit Regression Results ============================================================================== Dep. Variable: DRSTS No. Observations: 38545 Model: Logit Df Residuals: 38541 Method: MLE Df Model: 3 Date: Wed, 21 Dec 2016 Pseudo R-squ.: 0.04578 Time: 00:29:28 Log-Likelihood: -18260. converged: True LL-Null: -19136. LLR p-value: 0.000 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.8667 0.017 49.531 0.000 0.832 0.901 SIBSTS 0.2450 0.037 6.546 0.000 0.172 0.318 PRSTS 0.7872 0.038 20.948 0.000 0.714 0.861 GEN 0.9478 0.029 33.252 0.000 0.892 1.004 ============================================================================== Lower CI Upper CI OR Intercept 2.298729 2.461926 2.378928 SIBSTS 1.187283 1.374926 1.277664 PRSTS 2.041221 2.365176 2.197236 GEN 2.439810 2.728227 2.579991
SIBSTS : sibling drinking status
DRSTS : drinking(alcohol) status
PRSTS : parent drinking status
GEN : gender
Aim is to examine association between alcohol drinking status and sibling drinking status
Summary :
Association is verified as positive because there is clear association between DRSTS and SIBSTS after controlling for other two variables as p-value is less than 0.05.
No confounders are found.
Odds for alcohol consumption is 1.27 time higher if siblings also drink than whose siblings do not drink (OR= 1.277, 95% CI=1.19 - 1.37, p=.0.000).
Odds for alcohol consumption is 2.2 time higher if parents also drink than whose parents do not drink (OR= 2.2, 95% CI= 2.04 - 2.36, p=.0.000).
Odds for alcohol consumption is 2.58 time higher for male than for female (OR= 2.58, 95% CI=2.44 - 2.73, p=.0.000).
0 notes
Text
Regression Modelling in Practice Week 3 - Multiple Regression : Test a Multiple Regression Model
My is aim to find association between drinking status of siblings and number of siblings who drink to person’s age at onset of drinking.
Explanatory variable :
SIBNO - Number of siblings who drink
SIBTS - drinking status of siblings
PRSTS- drinking status of parent
GEN - Person’s gender
traking GEN and PRSTS as possible confounders.
Discussion of the results for the associations between all of explanatory variables and response variable :
Summary :
All explanatory variables have negative coefficient so they have negative linear relation with response variable.
Out of all explanatory variable only SIBNO has insignificant p-value while it was significant during linear regression.
So it was due to confounders.
Regression Diagnostic Plots :
Q-Q Plot:
The model does not fit for lower and upper quantiles.
Residuals Plot :
There are many observation that are more than 3 and less than -3. So there are many extreme outliers.
Regression Plots :
All explanatory variables other than SIBNO are categorical and have only two option so this plot is only drawn for SIBNO.
From residuals vs SIBNO plot it is clear that residuals are more at lower part and again more at higher part.
From Partial regression plot it shows that there is very weak negative linear relation between DRSTS and SIBNO.
Leverage Plot :
It shows that there are many outliers.
0 notes
Text
Regression Modelling in Practice Week 2 - Basics of Linear Regression : Test a Basic Linear Regression Model
Code :
Output :
Conclusion :
I have taken siblings’ drinking status (SIBSTS) as my categorical explanatory variable with two levels and age at onset of alcohol drinking as quantitative response variable (S2AQ16A) .
Average age at onset of alcohol drinking with SIBSTS as ‘Yes’ is found to be slightly less than that of SIBSTS as ‘No’.
P-value is very low and thus significant and we can neglect null hypothesis.Intercept is 19.0537 and slope is -0.5009. There is a negative linear association between these two variable but not that strong as slope value is very low.
0 notes
Text
Regression Modelling in Practice Week 1 - Introduction to Regression Assignment : Writing about your data
Sample:
The sample is from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), the largest nationwide longitudinal survey of alcohol and drug use and associated psychiatric and medical comorbidities. Participants (N=43,093) represented the civilian, non-institutionalized adult population of the United States, and included persons living in households, military personnel living off base, and persons residing in the following group quarters: boarding or rooming houses, non-transient hotels and motels, shelters, facilities for housing workers, college quarters, and group homes. The NESARC included over sampling of Blacks, Hispanics and young adults aged 18 to 24 years. The data analytic sample for this study included participants whose age at onset of alcohol consumption is between10-30 years (N=41281) as 95% fall in this range.
Procedure:
Data were collected by trained U.S. Census Bureau Field Representatives during 2001–2002 through computer-assisted personal interviews (CAPI). One adult was selected for interview in each household, and interviews were conducted in respondents’ homes following informed consent procedures.
Measures:
Response variable :
Drinking status of consumer data is taken from Section 2A: Alcohol Consumption of nesarc wave 1 dataset. Current drinker and ex-drinker are combined as one and coded as ‘1’ and lifetime abstainer are coded as ‘0’. Subset of age between 10-30 years at onset of alcohol consumption is taken as more than 95% are in this range.
Explanatory Variable :
From section 2D: Family History Of Alcoholism, brothers’ and sisters’ drinking status are combined as one siblings’ drinking status and number of brothers and sisters who drink are combined in one as number of siblings who drink.
I want to study relation between siblings’ drinking status and alcohol consumption.
0 notes
Text
Data Analysis Tools WEEK 4 : Exploring Statistical Interactions
Moderating Variable is gender
ANOVA :
OUTPUT :
Taking in account moderating variable:
OUTPUT:
OUTPUT :
CONCLUSION :
In general p-value is significant but when taking gender as moderating variable in males the mean value for ‘no’ case increased slightly and p-value becomes insignificant while for females the mean value of ‘no’ case decreases slightly and p-value decreased and becomes even more significant. So we can conclude that for males there is no association between ‘SIBSTS ‘- Siblings’ of drinking and ‘S2AQ16A’ - Age at onset of drinking
CHI SQUARE :
OUTPUT :
Taking in account moderating variable:
OUTPUT :
OUTPUT :
CONCLUSION :
All of three cases p-value is significant but in males changes of being drinker if siblings are drinker get increased while females it get decreased.
PEARSON CORRELATION :
OUTPUT :
Taking in account moderating variable:
OUTPUT :
OUTPUT :
CONCLUSION :
All of three have linear relation but negative in male , positive in female and negative in general. So in female relation get changed. However relation is only significant for males.
0 notes
Text
Data Analysis Tools WEEK 3 : Pearson Correlation Coefficient
Code :
Output :
Conclusion :
Correlation coefficient is negative and have very small value. So there is a very weak negative linear relation that is age at onset of drinking decreases slightly if number of siblings who drink increases. P-value is very very small that suggests that this linear relation does not exist due to sampling variability.
0 notes
Text
Data Analysis Tools WEEK 2 : Chi-Square Test of Independence
Code :
Output :
Observation & Discussion :
First I did Chi square test with a categorical explanatory variable(SIBSTS - siblings’ status of drinking) with only two category (0-no & 1-yes) and response variable ‘DRSTS’ (Drinking status- 0: no, 1:yes). I found p-value << 0.05. So we can reject null hypothesis and conclude that drinking status depends on siblings’ drinking status.
Next I did Chi square test with a categorical explanatory variable(’SIBNO’- no. of siblings who drink, ‘SIBNO’ is categorise in three category-(0,4),(4,8),(8,12) ) with 3 category. I got p-value > 0.05. So we can not reject null hypothesis.We can conclude that number siblings who drink don’t affect drinking status.As p-value > 0.05. We don’t need to do a post hoc test as post hoc test is done after getting significant p-value and then check pair wise association.
0 notes
Text
Data Analysis Tools Week 1 : Hypothesis Testing and ANOVA - Assignment
Code :
Output :
Conclusion and Discussion :
First I computed P - value for variable ‘S2AQ16A’ (Age when started drinking) and ‘SIBSTS’(Siblings’ status of drinking). I found P-value very low(p-value << 5%). So I am going to neglect Null Hypothesis. Siblings’ status with positive drinking status found to be have little less mean age(Age when started drinking).
In second case I have taken categorical explanatory variable with more than 2 category in which I found p-value vary low(p-value << 5%) .So I can neglect Null Hypothesis and conclude that ethnicity have effect on age(Age when started drinking). From Post Hoc test - Tukey HSD I found only group 2 and 4 have same mean age value.
0 notes
Text
Data Management and Visualisation Week - 4 Assignment : Data Management
Code :
Code only for plotting graphs
Output And Summary:
First uni-variate analysis
Number of current drinker is huge in comparison to other.
The frequency of people whose siblings drink is only approx 7500 while the frequency of people whose siblings do not drink is approximately 33000.
9 is corresponding to the people who do not drink. If we leave ‘9’ then this is a uni-modular(age 18 have highest frequency of approx 7100) graph and skewed from both sides.
Siblings with 0 number has most frequency. We can conclude that majority of people’s siblings don’t drink.
Bi-variate Analysis :
Person whose siblings drink have more proportion of alcohol consumption.This shows that person whose siblings drink have more tendency to be a drinker. This is exactly my hypothesis.
This graph shows that the number of siblings who drink increases, the proportion of alcohol consumption also increases i.e person chance of being drinker increases.So proportion of alcohol consumption increases with the number of siblings who drink.
I don’t find any conclusion from this scatter plot.
0 notes
Text
Data Management and Visualisation Week -3 Assignment : Data Management
Code :-
Output :-
Discussion :-
Subset the Data :
During the second assignment I found that more than 95% data of variable S2AQ16A ( AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS) is distributed over 10-30 age. So I decided to subset my data set to the age 10-30.(AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS)
Recording Missing Data :
Did for these variables :
S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIP
99. Unknown - set ‘99’ to ‘nan’
BL. NA, lifetime abstainer - coded as valid data
S2DQ3C1 NUMBER OF FULL BROTHERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS
99. Unknown - set ‘99’ to ‘nan’
S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS
9. Unknown - set ‘9′ to ‘nan’
S2DQ4C1 NUMBER OF FULL SISTERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS
99. Unknown - set ‘99’ to ‘nan’
S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS
9. Unknown - set ‘9′ to ‘nan’
Coding Valid Data:
CONSUMER DRINKING STATUS
26946 1. Current drinker
7881 2. Ex-drinker
8266 3. Lifetime Abstainer
S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS
33891 5-83. Age
936 99. Unknown
8266 BL. NA, lifetime abstainer
In Variable S2AQ16A ‘BL’ is valid data as it is corresponding to option ‘3’ in ‘CONSUMER’ variable.
Creating Secondary Variable:
As my question is focused on sibling, so I added brothers( S2DQ3C1-NUMBER OF FULL BROTHERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS) and sisters’( S2DQ4C1-NUMBER OF FULL SISTERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS ) number to get no. of siblings and recorded in another variable SIBNO.
As I am more concerned on siblings’ drinking status, so I defined another variable for siblings’s status of drinking using variables S2DQ3C2 (ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS) and S2DQ4C2 (ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS).
0 notes
Text
Data Management and Visualisation Week -2 Assignment : Running Your First Program
Code :
Output :
Summary :
Drinking status :-
Current Drinker : 62.53%
Ex Drinker : 18.29 %
Age When Started Drinking :-
14 - 22 years age has a percentage of 63.01%
18 years age has maximum percentage of 16.34% followed by 21 years age which has percentage of 10.37%
Number Of Full Brothers Who Were Ever Alcoholics Or Problem Drinkers :-
0 : 82.21%
1 : 10.23%
Any Full Brothers Ever Alcoholics Or Problem Drinkers :-
Yes : 14.41%
No : 83.0%
Number Of Full Sisters Who Were Ever Alcoholics Or Problem Drinkers :-
0 : 90.71%
1 : 4.28%
Any Full Sisters Ever Alcoholics Or Problem Drinkers :-
Yes : 6.28%
No : 91.38%
There were missing data.
0 notes