maniparas-blog
maniparas-blog
Data Analysis and Interpretation
18 posts
This blog is dedicated to the Coursera's Data Analysis and Interpretation course by Wesleyan University
Don't wanna be here? Send us removal request.
maniparas-blog · 8 years ago
Text
Capstone Project Assignment 2 : Methods
Sample :
The sample includes N = 891 passengers detail of Titanic ship.
Measures :
There are total eleven variables(or column) in Titanic dataset.
Survived : 0 = No and 1 = Yes pclass : Passenger class is a proxy for socio-economic status        (SES)\ 1st ~ Upper;  2nd ~ Middle;  3rd ~ Lower name : Name of passenger sex : Gender of passenger age : Age of passenger. Age is in Years.Fractional if Age         less than One (1) sibsp : Number of Siblings/Spouses Aboard parch : Number of Parents/Children Aboard ticket : Ticket Number fare : Passenger Fare cabin : Cabin of ship embarked :Port of Embarkation (C = Cherbourg; Q = Queenstown;          S = Southmpton)
Out of eleven variables, 'Survived' is target variable. Now to choose explanatory variables, we will choose only pclass, sex, age, embarked, sibsp and parch.
Passenger class is a proxy for socio-economic status. So upper class being rich or powerful have easy access to lifeboats and hence more chance of survival compared to other class.
Females are more likely to get lifeboat than the male passenger hence more probability of survival. Similarly, children are more likely to be boarded on lifeboats while old peoples who have already lived most of their life are less likely to board the lifeboats.
The importance of embarked, parch and sibsp variables will be visualized in next section.
Including variable 'name' in the explanatory variable will not make a sense because we can not interpret survival of person from their names excluding some particular names of celebrities, politician or a wealthy businessman. Similarly including variable 'ticket' in to explanatory variable will not make sense as mostly ticket numbers will be unique and also their are many missing data which is very difficult to fill in.
Cabin can be included but it have many missing data and filling those missing data is very tough.
Variable 'fare' seems like that it should be included in explanatory variable but if you analyse this variale deeply, you will realise that including this variable will result in bad model as 'fare' is confounded by 'age', 'embarked', 'pclass' or may be 'sex'. Fare of a passenger will depend on age (children have low fare), port of embarkation(l the longer you travel more you have to pay), passenger class(higher is the class more is the money) and also there can be a case that females were given some discounts. So 'fare' contains the information other variable which we have already included.
There are some missing data in variable 'age' which is set to the median value of age. Similarly missing data in 'embarked' are set to 'S' as 'S' has the highest frequency. Quantitative variable age is categorized into five categories - CHILDREN, ADOLESCENTS, ADULTS, MIDDLE AGE, OLD for better visualization.
Analysis :
The distributions for the predictors were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables. Scatter plots were also examined, and Pearson Correlation was used to test bivariate associations between individual predictors and the response variable.
Random forest algorithm was used to predict the survival of passengers. The random forest model was estimated on a training data set consisting of a random sample of 70%.  A test data set included the other 30% of the sample. Confusion matrix was used for evaluating the model.
0 notes
maniparas-blog · 8 years ago
Text
Capstone Project Assignment :- 1
Research Question: 
Analysis and prediction of survival on the Titanic.
The purpose of this study is to first analyze and then predict the survival on the Titanic.First analysis will be done to find out how different factors are involved in survival of passengers on the Titanic. Then after analyzing the factors prediction will be made.
Data for this research is taken from Kaggle competition which can be found here 
0 notes
maniparas-blog · 8 years ago
Text
Machine Learning For Data Analysis  Week 4 : K Means Clustering
Code:
from pandas import Series, DataFrame import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.cluster import KMeans
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
#Setting variable to numeric. data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce') data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'], errors='coerce') data['S2DQ3C1'] = pandas.to_numeric(data['S2DQ3C1'], errors='coerce') data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce') data['S2DQ4C1'] = pandas.to_numeric(data['S2DQ4C1'], errors='coerce') data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce') data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce') data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')
#subset data to the age 10 to 30 when started drinking sub1=data[((data['S2AQ16A']>=10) & (data['S2AQ16A']<=30))] #Copy new DataFrame sub2 = sub1.copy()
#Recording missing data sub2['S2AQ16A'] = sub2['S2AQ16A'].replace(99, numpy.nan) sub2['S2DQ3C1'] = sub2['S2DQ3C1'].replace(99, numpy.nan) sub2['S2DQ3C2'] = sub2['S2DQ3C2'].replace(9, numpy.nan) sub2['S2DQ4C1'] = sub2['S2DQ4C1'].replace(99, numpy.nan) sub2['S2DQ4C2'] = sub2['S2DQ4C2'].replace(9, numpy.nan) sub2['S2DQ1'] = sub2['S2DQ1'].replace(9, numpy.nan) sub2['S2DQ2'] = sub2['S2DQ2'].replace(9, numpy.nan)
#creating a secondary variable for calculating sibling number. sub2['SIBNO'] = sub2['S2DQ3C1'] + sub2['S2DQ4C1']
#defining new variable for sibling drinking status by combining data of brothers and sisters def SIBSTS(row):    if any([row['S2DQ3C2'] == 1, row['S2DQ4C2'] == 1]) :        return 1          elif all([row['S2DQ3C2'] == 2, row['S2DQ4C2'] == 2]) :        return 0        else :          return numpy.nan     sub2['SIBSTS'] = sub2.apply(lambda row: SIBSTS (row),axis=1)  
#defining new variable for parent status status of drinking def PRSTS(row):    if any([row['S2DQ1'] == 1, row['S2DQ2'] == 1]) :        return 1          elif all([row['S2DQ1'] == 2, row['S2DQ2'] == 2]) :        return 0        else :          return numpy.nan     sub2['PRSTS'] = sub2.apply(lambda row: PRSTS (row),axis=1)  
#recoding values for 'CONSUMER' into a new variable, DRSTS recode1 = {1: 1, 2: 0, 3: 0} sub2['DRSTS']= sub2['CONSUMER'].map(recode1)
#recoding new values for SEX variable recode2 = {1: 1, 2: 0} sub2['GEN']= sub2['SEX'].map(recode2)
data_clean = sub2.dropna() """ Modeling and Prediction """ #Split into training and testing sets
pred = data_clean[['SIBNO','SIBSTS','PRSTS','GEN','S2AQ16A']]
clustervar=pred.copy() from sklearn import preprocessing
clustervar['SIBNO']=preprocessing.scale(clustervar['SIBNO'].astype('float64')) clustervar['SIBSTS']=preprocessing.scale(clustervar['SIBSTS'].astype('float64')) clustervar['PRSTS']=preprocessing.scale(clustervar['PRSTS'].astype('float64')) clustervar['GEN']=preprocessing.scale(clustervar['GEN'].astype('float64')) clustervar['S2AQ16A']=preprocessing.scale(clustervar['S2AQ16A'].astype('float64'))
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)                                                        
# k-means cluster analysis for 1-9 clusters                                                           from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters:    model=KMeans(n_clusters=k)    model.fit(clus_train)    clusassign=model.predict(clus_train)    meandist.append(sum(numpy.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))    / clus_train.shape[0])
""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
# Interpret 4 cluster solution model3=KMeans(n_clusters=4) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
""" BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pandas.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """
# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
# validate clusters in training data by examining cluster differences in GPA using ANOVA # first have to merge GPA with clustering variables and cluster assignment data gpa_data=data_clean['DRSTS'] # split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pandas.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pandas.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['DRSTS', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='DRSTS ~ C(cluster)', data=sub1).fit() print (gpamod.summary())
print ('means for GPA by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['DRSTS'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())                                      
Output :
A k-means cluster analysis was performed to identify underlying subgroups based on their similarity of responses on 5 variables that represent characteristics that could have an impact on drinking status of individuals
. Clustering variables included three binary variables PRSTS (Parents drinking status), SIBSTS (Sibling drinking status), GEN (gender) and two quantitative variables SIBNO(No.of siblings who drinks) and age at onset of drinking( S2AQ16A )
Tumblr media
As possible cases are 3, 4, 6, and 8 but I chose 4 clustering group. 
Tumblr media
Cluster is not good as it can be seen from figure.
Clustering variable means by cluster               index             SIBNO         SIBSTS     PRSTS       GEN   S2AQ16A cluster                                                                 0        21945.028524  -0.388644  -0.494159   -0.571713  -0.957500  0.255108 1        21747.304583  -0.388644  -0.494159   -0.571713  1.044387   -0.109161 2        21379.038997  -0.388644  -0.494159   1.749130  -0.106187   -0.162766 3        20697.274571   1.578009    2.023639   0.516130  -0.095059  -0.135573
                           OLS Regression Results                             ============================================================================== Dep. Variable:                  DRSTS   R-squared:                       0.008 Model:                            OLS   Adj. R-squared:                  0.008 Method:                 Least Squares   F-statistic:                     57.62 Date:                Wed, 01 Feb 2017   Prob (F-statistic):           4.36e-37 Time:                        19:32:35   Log-Likelihood:                -10854. No. Observations:               20885   AIC:                         2.172e+04 Df Residuals:                   20881   BIC:                         2.175e+04 Df Model:                           3                                         Covariance Type:            nonrobust                                         ===================================================================================                      coef    std err          t      P>|t|      [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept           0.7672      0.005    153.863      0.000         0.757     0.777 C(cluster)[T.1]     0.0644      0.007      9.202      0.000         0.051     0.078 C(cluster)[T.2]     0.0403      0.009      4.624      0.000         0.023     0.057 C(cluster)[T.3]    -0.0318      0.008     -3.952      0.000        -0.048    -0.016 ============================================================================== Omnibus:                     3951.048   Durbin-Watson:                   2.013 Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6758.200 Skew:                          -1.393   Prob(JB):                         0.00 Kurtosis:                       2.991   Cond. No.                         4.44 ==============================================================================
means for DRSTS by cluster            DRSTS cluster           0        0.767152 1        0.831582 2        0.807490 3        0.735330
standard deviations for DRSTS by cluster            DRSTS cluster           0        0.422678 1        0.374264 2        0.394332 3        0.441211
Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff  lower   upper  reject ---------------------------------------------  0      1     0.0644   0.0464  0.0824  True  0      2     0.0403   0.0179  0.0628  True  0      3    -0.0318  -0.0525 -0.0111  True  1      2    -0.0241  -0.0464 -0.0018  True  1      3    -0.0963  -0.1168 -0.0757  True  2      3    -0.0722  -0.0967 -0.0476  True ---------------------------------------------
0 notes
maniparas-blog · 8 years ago
Text
Machine Learning For Data Analysis  Week 3 : Lasso Regression
Code :
import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
#Setting variable to numeric. data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce') data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'], errors='coerce') data['S2DQ3C1'] = pandas.to_numeric(data['S2DQ3C1'], errors='coerce') data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce') data['S2DQ4C1'] = pandas.to_numeric(data['S2DQ4C1'], errors='coerce') data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce') data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce') data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')
#subset data to the age 10 to 30 when started drinking sub1=data[((data['S2AQ16A']>=10) & (data['S2AQ16A']<=30))] #Copy new DataFrame sub2 = sub1.copy()
#Recording missing data sub2['S2AQ16A'] = sub2['S2AQ16A'].replace(99, numpy.nan) sub2['S2DQ3C1'] = sub2['S2DQ3C1'].replace(99, numpy.nan) sub2['S2DQ3C2'] = sub2['S2DQ3C2'].replace(9, numpy.nan) sub2['S2DQ4C1'] = sub2['S2DQ4C1'].replace(99, numpy.nan) sub2['S2DQ4C2'] = sub2['S2DQ4C2'].replace(9, numpy.nan) sub2['S2DQ1'] = sub2['S2DQ1'].replace(9, numpy.nan) sub2['S2DQ2'] = sub2['S2DQ2'].replace(9, numpy.nan)
#creating a secondary variable for calculating sibling number. sub2['SIBNO'] = sub2['S2DQ3C1'] + sub2['S2DQ4C1']
#defining new variable for sibling drinking status by combining data of brothers and sisters def SIBSTS(row):    if any([row['S2DQ3C2'] == 1, row['S2DQ4C2'] == 1]) :        return 1          elif all([row['S2DQ3C2'] == 2, row['S2DQ4C2'] == 2]) :        return 0        else :          return numpy.nan     sub2['SIBSTS'] = sub2.apply(lambda row: SIBSTS (row),axis=1)  
#defining new variable for parent status status of drinking def PRSTS(row):    if any([row['S2DQ1'] == 1, row['S2DQ2'] == 1]) :        return 1          elif all([row['S2DQ1'] == 2, row['S2DQ2'] == 2]) :        return 0        else :          return numpy.nan     sub2['PRSTS'] = sub2.apply(lambda row: PRSTS (row),axis=1)  
#recoding values for 'CONSUMER' into a new variable, DRSTS recode1 = {1: 1, 2: 0, 3: 0} sub2['DRSTS']= sub2['CONSUMER'].map(recode1)
#recoding new values for SEX variable recode2 = {1: 1, 2: 0} sub2['GEN']= sub2['SEX'].map(recode2)
data_clean = sub2.dropna() """ Modeling and Prediction """ #Split into training and testing sets
pred = data_clean[['SIBNO','SIBSTS','PRSTS','GEN']] target = data_clean['S2AQ16A']
predictors=pred.copy() from sklearn import preprocessing
predictors['SIBNO']=preprocessing.scale(predictors['SIBNO'].astype('float64')) predictors['SIBSTS']=preprocessing.scale(predictors['SIBSTS'].astype('float64')) predictors['PRSTS']=preprocessing.scale(predictors['PRSTS'].astype('float64')) predictors['GEN']=preprocessing.scale(predictors['GEN'].astype('float64'))
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,                                                              test_size=.3, random_state=123)
# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))
# plot coefficient progression m_log_alphas = -numpy.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k',            label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold m_log_alphascv = -numpy.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',         label='Average across the folds', linewidth=2) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k',            label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)
OUTPUT :
Regression coefficient : 
GEN : -0.64438534614104892 PRSTS : -0.38464097596963004 SIBNO : -0.057518355093678704 SIBSTS : -0.10840676158247332
training data MSE :  11.1956810873 test data MSE :  11.5041705471 training data R-square :  0.0485507314614 test data R-square :  0.0454230302836
Tumblr media Tumblr media
Conclusion :
A Lasso Regression analysis is conducted to identify subset of best predictors out of total of four predictors. Target variable is a quantitative variable which describe age at onset of alcohol consumption. Predictors variable are SIBSTS (Siblings’ drinking status), PRSTS(parents’ drinking status), GEN(GENDER) these all three are categorical variable and one quantitative variable SIBNO(Number of sibling who drinks).
All predictors variable are standardise to have mean equal to zero and  standard deviation equal to 1.
Data is split in to 70% of test data and 30% of test data. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
After running lasso regression it is found that all predictor variables have negative coefficient and none of them have zero coefficient value.GEN have maximum coefficient value while SIBNO have least.
Model is not good as only 4% variability is explained by this model.
0 notes
maniparas-blog · 8 years ago
Text
Machine Learning For Data Analysis  Week 2 : Random Forest
Code :
predictors = data_clean[['S2AQ16A','SIBNO','SIBSTS','PRSTS','GEN']]
targets = data_clean['DRSTS'].astype('category')
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)
print(pred_train.shape) print(pred_test.shape) print(tar_train.shape) print(tar_test.shape)
Output :
(17902, 5) (11935, 5) (17902,) (11935,)
Code :
#Build model on training data from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions))
Output :
[[  36 2489] [  95 9315]]
0.783493925429
Code :
# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_)
Output :
[ 0.54571632  0.24291864  0.05134129  0.05579401  0.10422973]
Code :
""" Running a different number of trees and see the effect of that on the accuracy of the prediction """
trees=range(25) accuracy=numpy.zeros(25)
for idx in range(len(trees)):   classifier=RandomForestClassifier(n_estimators=idx + 1)   classifier=classifier.fit(pred_train,tar_train)   predictions=classifier.predict(pred_test)   accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla() plt.plot(trees, accuracy)
Output :
Tumblr media
Conclusion :
Aim is to find important explanatory variable for predicting binary response variable ‘DRSTS’ (Drinking Status).
The explanatory variables with the highest relative importance scores were 'S2AQ16A ‘(Age at onset of alcohol consumption), ‘SIBNO’(No. of siblings who drink) and ‘GEN’(sex). The accuracy of the random forest was 78%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.
0 notes
maniparas-blog · 8 years ago
Text
Machine Learning For Data Analysis  Week 1 : Decision Tree
My target variable ‘DRSTS’ is a categorical variable and it describes drinking status.
Predictor variable are :-
SIBSTS : Sibling drinking status
PRSTS : Parents drinking status
GEN : Gender
SIBNO : No. of siblings who drink
S2AQ16 : Age at onset of drinking
After running decision tree classifier I got accuracy 1.0 and only one element in confusion matrix. I also got only one node in decision tree classifier image.
Please help me where I am making mistakes. Code and Output is below.
code :
Tumblr media Tumblr media Tumblr media
Output :
Code output 
Shape of training and test data 
(17902, 5) (11935, 5) (17902L,) (11935L,)
Confusion matrix and classification accuracy : [[11935]] 1.0
Decision tree graph :
Tumblr media
0 notes
maniparas-blog · 8 years ago
Text
Regression Modelling in Practice Week 4 - Logistic Regression : Test a Logistic Regression Model
Output : 
                          Logit Regression Results                           ============================================================================== Dep. Variable:                  DRSTS    No. Observations:            38545 Model:                              Logit       Df Residuals:                    38541 Method:                            MLE        Df Model:                          3 Date:            Wed, 21 Dec 2016      Pseudo R-squ.:                 0.04578 Time:                        00:29:28         Log-Likelihood:                -18260. converged:                       True        LL-Null:                            -19136.                                                         LLR p-value:                     0.000 ==============================================================================                          coef    std err          z      P>|z|      [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept      0.8667      0.017     49.531      0.000         0.832     0.901 SIBSTS         0.2450      0.037      6.546      0.000         0.172     0.318 PRSTS          0.7872      0.038     20.948      0.000         0.714     0.861 GEN              0.9478      0.029     33.252      0.000         0.892     1.004 ==============================================================================                    Lower CI    Upper CI       OR Intercept    2.298729     2.461926    2.378928 SIBSTS     1.187283     1.374926     1.277664 PRSTS      2.041221     2.365176     2.197236 GEN           2.439810     2.728227     2.579991
SIBSTS : sibling drinking status
DRSTS : drinking(alcohol) status
PRSTS : parent drinking status
GEN : gender
Aim is to examine association between alcohol drinking status and sibling drinking status
Summary :
Association is verified as positive because there is clear association between DRSTS and SIBSTS after controlling for other two variables as p-value is less than 0.05.
No confounders are found.
Odds for alcohol consumption is 1.27 time higher if siblings also drink than whose siblings do not drink (OR= 1.277, 95% CI=1.19 - 1.37, p=.0.000).
Odds for alcohol consumption is 2.2 time higher if parents also drink than whose parents do not drink (OR= 2.2, 95% CI= 2.04 - 2.36, p=.0.000).
Odds for alcohol consumption is 2.58 time higher for male than for female (OR= 2.58, 95% CI=2.44 - 2.73, p=.0.000).
0 notes
maniparas-blog · 8 years ago
Text
Regression Modelling in Practice Week 3 - Multiple Regression : Test a Multiple Regression Model
My is aim to find association between drinking status of siblings and number of siblings who drink to person’s age at onset of drinking.
Explanatory variable : 
SIBNO - Number of siblings who drink
SIBTS - drinking status of siblings
PRSTS- drinking status of parent
GEN - Person’s gender
traking GEN and PRSTS as possible confounders.
Discussion of the results for the associations between all of  explanatory variables and response variable :
Tumblr media
Summary :
All explanatory variables have negative coefficient so they have negative linear relation with response variable.
Out of all explanatory variable only SIBNO has insignificant p-value while it was significant during linear regression.   
So it was due to confounders.
Regression Diagnostic Plots :
Q-Q Plot:
Tumblr media
The model does not fit for lower and upper quantiles.
Residuals Plot :
Tumblr media
There are many observation that are more than 3 and less than -3. So there are many extreme outliers.
Regression Plots :
Tumblr media
All explanatory variables other than SIBNO are categorical and have only two option so this plot is only drawn for SIBNO.
From residuals vs SIBNO plot it is clear that residuals are more at lower part and again more at higher part. 
From Partial regression plot it shows that there is very weak negative linear relation between DRSTS and SIBNO.
Leverage Plot : 
Tumblr media
It shows that there are many outliers.
0 notes
maniparas-blog · 9 years ago
Text
Regression Modelling in Practice Week 2 - Basics of Linear Regression : Test a Basic Linear Regression Model
Code :
Tumblr media Tumblr media
Output :
Tumblr media
Conclusion :
I have taken siblings’ drinking status (SIBSTS) as my  categorical explanatory variable with two levels and age at onset of alcohol drinking as quantitative response variable (S2AQ16A) .
Average age at onset of alcohol drinking with SIBSTS as ‘Yes’  is found to be slightly less than that of SIBSTS as ‘No’.
P-value is very low and thus significant and we can neglect null hypothesis.Intercept is 19.0537 and slope is -0.5009. There is a negative linear association between these two variable but not that strong as slope value is very low.
0 notes
maniparas-blog · 9 years ago
Text
Regression Modelling in Practice Week 1 - Introduction to Regression  Assignment : Writing about your data
Sample:  
The sample is from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), the largest nationwide longitudinal survey of alcohol and drug use and associated psychiatric and medical comorbidities. Participants (N=43,093) represented the civilian, non-institutionalized adult population of the United States, and included persons living in households, military personnel living off base, and persons residing in the following group quarters: boarding or rooming houses, non-transient hotels and motels, shelters, facilities for housing workers, college quarters, and group homes. The NESARC included over sampling of Blacks, Hispanics and young adults aged 18 to 24 years. The data analytic sample for this study included participants whose age at onset of alcohol consumption is between10-30 years (N=41281) as 95% fall in this range.
 Procedure:
Data were collected by trained U.S. Census Bureau Field Representatives during 2001–2002 through computer-assisted personal interviews (CAPI). One adult was selected for interview in each household, and interviews were conducted in respondents’ homes following informed consent procedures.
 Measures:  
Response variable :
Drinking status of consumer data is taken from Section 2A: Alcohol Consumption of nesarc wave 1 dataset. Current drinker and ex-drinker are combined as one and coded as ‘1’ and lifetime abstainer are coded as ‘0’.  Subset of age between 10-30 years at onset of alcohol consumption is taken as more than 95% are in this range. 
Explanatory Variable :
From section 2D: Family History Of Alcoholism, brothers’ and sisters’ drinking status are combined as one siblings’ drinking status and number of brothers and sisters who drink are combined in one as number of siblings who drink.
I want to study relation between siblings’ drinking status and alcohol consumption.
0 notes
maniparas-blog · 9 years ago
Text
Data Analysis Tools  WEEK 4 : Exploring Statistical Interactions
Moderating Variable is gender
ANOVA :
Tumblr media
OUTPUT :
Tumblr media
Taking in account moderating variable:
Tumblr media
OUTPUT:
Tumblr media Tumblr media
OUTPUT :
Tumblr media
CONCLUSION :
In general p-value is significant but when taking gender as moderating variable in males the mean value for ‘no’ case increased slightly and p-value becomes insignificant while for females the mean value of ‘no’ case decreases slightly and p-value decreased and becomes even more significant. So we can conclude that for males there is no association between ‘SIBSTS ‘- Siblings’  of drinking and ‘S2AQ16A’ - Age at onset of drinking
CHI SQUARE :
Tumblr media
OUTPUT :
Tumblr media
Taking in account moderating variable:
Tumblr media
OUTPUT :
Tumblr media Tumblr media
OUTPUT :
Tumblr media
CONCLUSION :
All of three cases p-value is significant but in males changes of being drinker if siblings are drinker get increased while females it get decreased.
PEARSON CORRELATION :
Tumblr media
OUTPUT :
Tumblr media
Taking in account moderating variable:
Tumblr media
OUTPUT :
Tumblr media Tumblr media
OUTPUT :
Tumblr media
CONCLUSION :
All of three have linear relation but negative in male , positive in female  and negative in general. So in female relation get changed. However relation is only significant for males. 
0 notes
maniparas-blog · 9 years ago
Text
Data Analysis Tools  WEEK 3 : Pearson Correlation Coefficient
Code :
Tumblr media Tumblr media
Output :
Tumblr media
Conclusion :
Correlation coefficient is negative and  have very small value. So there is a very weak negative linear relation that is age at onset of drinking decreases slightly if number of siblings who drink increases. P-value is very very small that suggests that this linear relation does not exist  due to sampling variability. 
0 notes
maniparas-blog · 9 years ago
Text
Data Analysis Tools  WEEK 2 : Chi-Square Test of Independence
Code :
Tumblr media Tumblr media Tumblr media Tumblr media
Output :
Tumblr media
Observation & Discussion :
First I did Chi square test with a categorical explanatory variable(SIBSTS - siblings’ status of drinking) with only two category (0-no & 1-yes) and response variable ‘DRSTS’ (Drinking status- 0: no, 1:yes). I found p-value << 0.05. So we can reject null hypothesis and conclude that drinking status depends on siblings’ drinking status.
Next I did Chi square test with a categorical explanatory variable(’SIBNO’- no. of siblings who drink, ‘SIBNO’ is categorise in three category-(0,4),(4,8),(8,12) ) with 3 category. I got p-value > 0.05. So we can not reject null hypothesis.We can conclude that number siblings who drink don’t affect drinking status.As p-value > 0.05. We don’t need to do a post hoc test as post hoc test is done after getting significant p-value and then check pair wise association.
0 notes
maniparas-blog · 9 years ago
Text
Data Analysis Tools  Week 1 : Hypothesis Testing and ANOVA - Assignment
Code :
Tumblr media Tumblr media Tumblr media Tumblr media
Output :
Tumblr media Tumblr media Tumblr media
Conclusion and Discussion :
First I computed P - value for variable ‘S2AQ16A’ (Age when started drinking) and ‘SIBSTS’(Siblings’ status of drinking). I found P-value very low(p-value << 5%). So I am going to neglect Null Hypothesis. Siblings’ status with positive drinking status found to be have little less mean age(Age when started drinking).
In second case I have taken categorical explanatory variable with more than 2 category in which I found p-value vary low(p-value << 5%) .So I can neglect Null Hypothesis and conclude that ethnicity have effect on age(Age when started drinking). From Post Hoc test - Tukey HSD I found only group 2 and 4 have same mean age value.
0 notes
maniparas-blog · 9 years ago
Text
Data Management and Visualisation Week - 4 Assignment : Data Management
Code :
Code only for plotting graphs
Tumblr media Tumblr media
Output  And Summary:
First uni-variate analysis 
Tumblr media
Number of current drinker is huge in comparison to other.
Tumblr media
The frequency of people whose siblings drink is only approx 7500 while the frequency of people whose siblings do not drink is approximately 33000.
Tumblr media
9 is corresponding to the people who do not drink. If we leave ‘9’ then this is a uni-modular(age 18  have highest frequency of approx 7100) graph and skewed from both sides.  
Tumblr media
Siblings with 0 number has most frequency. We can conclude that majority of people’s siblings don’t drink.
Bi-variate Analysis :
Tumblr media
Person whose siblings drink have  more proportion of alcohol consumption.This shows that person whose siblings drink have more tendency to be a drinker. This is exactly my hypothesis.
Tumblr media
This graph shows that the number of siblings who drink increases, the proportion of alcohol consumption also increases i.e person chance of being drinker increases.So proportion of alcohol consumption increases with the number of siblings who drink.
Tumblr media
 I don’t find any conclusion from this scatter plot.
0 notes
maniparas-blog · 9 years ago
Text
Data Management and Visualisation Week -3 Assignment : Data Management
Code :-
Tumblr media Tumblr media Tumblr media
Output :-
Tumblr media Tumblr media
Discussion :-
Subset the Data :
During the second assignment  I found that more than 95%  data of variable S2AQ16A ( AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS)  is distributed over 10-30 age. So I decided to subset my data set  to the age 10-30.(AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS)
Recording Missing Data : 
Did for these variables :
S2AQ16A        AGE WHEN STARTED DRINKING, NOT COUNTING SMALL                              TASTES OR SIP
   99.    Unknown   -   set  ‘99’ to ‘nan’
   BL.    NA, lifetime abstainer  -  coded as valid data
S2DQ3C1        NUMBER OF FULL BROTHERS WHO WERE EVER                                           ALCOHOLICS OR PROBLEM DRINKERS
    99.   Unknown  -  set ‘99’ to ‘nan’
S2DQ3C2        ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM                               DRINKERS
   9. Unknown  -  set ‘9′ to ‘nan’
S2DQ4C1        NUMBER OF FULL SISTERS WHO WERE EVER                                           ALCOHOLICS OR PROBLEM DRINKERS
   99.   Unknown  -  set ‘99’ to ‘nan’
S2DQ4C2        ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM                               DRINKERS
  9. Unknown  -  set ‘9′ to ‘nan’
Coding Valid Data:
CONSUMER       DRINKING STATUS
26946 1. Current drinker
7881  2. Ex-drinker
8266   3. Lifetime Abstainer
S2AQ16A        AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS
33891  5-83.   Age
936    99.    Unknown
8266   BL.    NA, lifetime abstainer
In Variable S2AQ16A  ‘BL’ is valid data as it is corresponding to option ‘3’ in ‘CONSUMER’ variable.
Creating Secondary Variable:
As my question is focused on sibling, so I added brothers( S2DQ3C1-NUMBER OF FULL BROTHERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS) and sisters’( S2DQ4C1-NUMBER OF FULL SISTERS WHO WERE EVER ALCOHOLICS OR PROBLEM DRINKERS ) number to get no. of siblings and recorded in another variable SIBNO.
As I am more concerned on siblings’ drinking status, so I defined another variable for siblings’s status of drinking using variables  S2DQ3C2 (ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS) and  S2DQ4C2 (ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS).
0 notes
maniparas-blog · 9 years ago
Text
Data Management and Visualisation Week -2 Assignment : Running Your First Program
Code :
Tumblr media Tumblr media
Output :
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Summary :
Drinking status :-
Current Drinker : 62.53%
Ex Drinker : 18.29 %
Age When Started Drinking :-
14 - 22  years age has a percentage of 63.01%
18 years age has maximum percentage of 16.34% followed by 21 years age which has percentage of 10.37%
Number Of Full Brothers Who Were Ever Alcoholics Or Problem Drinkers  :-
0 : 82.21%
1 : 10.23%
Any Full Brothers Ever Alcoholics Or Problem Drinkers :- 
Yes : 14.41%
No : 83.0%
Number Of Full Sisters Who Were Ever Alcoholics Or Problem Drinkers  :-
0 : 90.71%
1 : 4.28%
Any Full Sisters Ever Alcoholics Or Problem Drinkers :-
Yes : 6.28%
No : 91.38%
There were missing data.
0 notes