dmitrybachindata-blog
dmitrybachindata-blog
Assignments in Data Science Course
18 posts
Don't wanna be here? Send us removal request.
dmitrybachindata-blog · 7 years ago
Text
k-means cluster analysis
A k-means analysis was carried out to identify subgroups of adults (age is higher than 18) based on 10 variables which are listed in the table below to evaluate the influence of these characteristics on smoking behavior. As validation variable, I have used tobacco use status with 1 for current smokers, 0.5 -  ex-smokers and 0 for lifetime non-smokers.  All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Tumblr media
The whole data were divided into a training and test sets by random. The training set included 70% participants (N=30165) and the test set contained 30% of participants (N=12928). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. 
As a result, I have received the Elbow curve with pretty obvious bends on position 2,3 and 6. Cause the band on position six demonstrated much less decreasing, I have decided to 
Tumblr media
In order to reduce 10 clustering variables the canonical discriminant analysis was used. As result, the scatterplot was built. We can see three distinct groups on this plot. The top one is contain only one kind of variable. The bottom one can be visually separated, and the miidle one is difficult to separate.
Tumblr media
Cause of this pretty confused results. I have changed number of cluster to 3(next bend on the elbow plot.)
Tumblr media
The picture is clearer now. We have three separate group. The bottom one contains two clusters which easily can be separated. The top one cluster. The middle on is a mix of all clusters.
The cluster 0 tends to be the youngest, wealthy and educated, not covered by Medicare or Medicaid, has a higher probability to be alcohol consumption and tend more than other to have lifetime major depression and not to have an experience fear or avoidance of social situation. 
The cluster 1 is a group with higher probability to be elder, to have Medicare or Medicaid, and to be women. Also, the participants from this cluster tend to be less educated and wealthy than any other cluster. Additionally. this cluster tends not to have a 2-week period of depression and not to be alcohol consumption.
The last cluster (2) have the first place in “had fear or avoidance of social situation” and “ ever had a 2-week period of depression, sadness...” and holds the last places considering major depression lifetime.
The result of comparison says that there is difference between all clusters (p-value is close to 0). ANd posthoc test shows that null hypothesis is reject in every case. The results shows people from the cluster 2 tend to be smokers with higher probability than 0 and 1. The cluster 1 has the least probability considering tobacco use status.
However, comparing with other analysis in this course. In this analysis I have received the least R-square= 0.002, so this model can explain only 0.2% of variance in tobacco use status. 
Output(without pictures, they are above):
train and test 30165.1 12927.9
 Clustering variable means by cluster                index      S7Q1       AGE    S1Q12A     S4AQ1       SEX  \ cluster                                                                   0        21588.360434 -0.098457 -0.377225  0.095464 -0.134877  0.024886   1        20760.229365 -0.061749  1.328642 -0.333813 -0.143761 -0.117437   2        22743.101286  4.421346  0.026240 -0.015789  6.456610  0.041288  
         S1Q14C1   S1Q14C2     S1Q6A  CONSUMER  MAJORDEPLIFE   cluster                                                         0       -0.514203 -0.059903  0.121411  0.124950      0.036088   1        1.830298  0.244347 -0.432456 -0.411861     -0.088959   2        0.056431 -0.002250 -0.124050 -0.355958     -0.467381                              OLS Regression Results                             ============================================================================== Dep. Variable:                 SMOKER   R-squared:                       0.002 Model:                            OLS   Adj. R-squared:                  0.002 Method:                 Least Squares   F-statistic:                     30.31 Date:                Wed, 24 Jan 2018   Prob (F-statistic):           7.08e-14 Time:                        16:13:10   Log-Likelihood:                -16998. No. Observations:               30165   AIC:                         3.400e+04 Df Residuals:                   30162   BIC:                         3.403e+04 Df Model:                           2                                         Covariance Type:            nonrobust                                         ===================================================================================                      coef    std err          t      P>|t|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept           0.3597      0.003    128.380      0.000       0.354       0.365 C(cluster)[T.1]    -0.0190      0.006     -3.186      0.001      -0.031      -0.007 C(cluster)[T.2]    -0.1266      0.017     -7.326      0.000      -0.160      -0.093 ============================================================================== Omnibus:                     1526.319   Durbin-Watson:                   2.009 Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4085.850 Skew:                           0.594   Prob(JB):                         0.00 Kurtosis:                       1.643   Cond. No.                         7.26 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for tobacco use status by cluster           SMOKER cluster           0        0.359674 1        0.340672 2        0.233119 standard deviations for tobacco use status by cluster           SMOKER cluster           0        0.438970 1        0.376090 2        0.386357 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff  lower   upper  reject ---------------------------------------------  0      1     -0.019   -0.033  -0.005  True  0      2    -0.1266   -0.167 -0.0861  True  1      2    -0.1076  -0.1494 -0.0657  True ---------------------------------------------
Code:
# -*- coding: utf-8 -*- """ Created on Mon Jan 18 19:51:29 2016
@author: jrose01 """
from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
""" Data Management """ data = pd.read_csv("tree_addhealth.csv")
#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
# subset clustering variables cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1', 'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']] cluster.describe()
# standardize clustering variables to have mean=0 and sd=1 clustervar=cluster.copy() clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64')) clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64')) clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64')) clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64')) clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64')) clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64')) clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64')) clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64')) clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64')) clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64')) clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))
# split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters                                                           from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters:    model=KMeans(n_clusters=k)    model.fit(clus_train)    clusassign=model.predict(clus_train)    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))    / clus_train.shape[0])
""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
# Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
""" BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """
# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
# validate clusters in training data by examining cluster differences in GPA using ANOVA # first have to merge GPA with clustering variables and cluster assignment data gpa_data=data_clean['GPA1'] # split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pd.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['GPA1', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit() print (gpamod.summary())
print ('means for GPA by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
0 notes
dmitrybachindata-blog · 7 years ago
Text
LASSO regression
Cause I am really interested how different is resulted recieved by various methods, my goal is the same: find out what kind of people tend to be smokers.
To achieve the goal I have chosen 11 variables, which is listed in the table below. The variables are both quantitative and categorical. Also education is a categorical variable, but I have decided to consider it like a quantitative because it is ranged by different level of education.
Tumblr media
The training set included 70% of the participants (N=29150) and a validation set contained 30% of them (N=12390). In order to estimate the lasso regression model, the k=10 fold cross validation algorithm was used.
Tumblr media
Figure 1. Change in the validation mean square error at each step
I have used 11 putative explanatory variables. Concluding the model from the LASSO regression, 10 of them retained in the selected model. The only variable which has not been considered in the model is body mass index.
As we can see in the figure 1 the alcohol consuming (CONSUMER), major depression(MAJORDEPLIFE) and sex are most strongly associated with smoking (the association with SEX means that males tend to smoke more than females). The strongest negative association with smoking is education(S1Q6A) and total household income in last 12 months(S1Q12A). Other variables had tiny associations and all of them (except coverage by Medicare) had a positive association with smoking. 
However, R-square was only 0.1025, so this model can be explanatory only for 10.25% of the variance of response variable(tobacco use status),
Output:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Code:
data ; set new; LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
/*Counting Body Mass Index*/       BMI = 703*(S1Q24LB)/((12*S1Q24FT+S1Q24IN)**2);
LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'      S1Q6A ='Education'      BMI = 'Body mass index'      MAJORDEPLIFE = 'Major depression - lifetime'      S1Q12A = 'Total household income in last 12 months'      AGE ='AGE'      SEX ='SEX'      S1Q14C1 = 'Currently covered by medicare'      S1Q14C2 = 'Currently covered by medicaid'      CONSUMER = 'Drinking status'      S4AQ1 ='Ever had 2-week period when felt sad, blue, depressed, or down most of time';
/*setting 0 for lifetime non-smoker, 0.5 - for ex-smoker, 1 for current smoker*/ if SMOKER=2 then SMOKER=0.5; if SMOKER=3 then SMOKER=0;
/*adjusting drinking status for 0 - lifetime abstainer, 0.5 ex-drinker, 1 - current drinker*/ if CONSUMER=3 then CONSUMER=0; if CONSUMER=2 then CONSUMER=0.5;
/*adjusting sex for male - 1 and 0 for female*/ if SEX=2 then SEX=0;
/*making 0 for NO instead of 2*/ if S1Q14C1=2 then S1Q14C1=0; if S1Q14C2=2 then S1Q14C2=0; if S7Q1=2 then S7Q1=0; if S4AQ1=2 then S4AQ1=0;
/*excluding unknown results*/ if S7Q1=9 then S7Q1=.; if S4AQ1= 9 then S4AQ1=.;
ods graphics on;
proc surveyselect data=new out=traintest seed = 123 samprate=0.7 method=srs outall; run;
proc glmselect data=traintest plots=all seed=123;     partition fraction(validate=0.3);     model SMOKER = S7Q1 S4AQ1 SEX S1Q14C1 S1Q14C2 CONSUMER AGE S1Q12A S1Q6A     BMI MAJORDEPLIFE/selection=lar(choose=cv stop=none) cvmethod=random(10); run; quit;
0 notes
dmitrybachindata-blog · 7 years ago
Text
The random forest
The goal of the random forest was to find what kind of people tend to smoke. In the decision tree, I have found that males elder 42.6, consuming alcohol, tend to smoke and any other categories tended not to smoke. However, the random forest has shown a little bit different result.
For Random Forest I have chosen the following variables: Tobacco use status, ‘Ever had strong fear or avoidance of social situation', 'Education' (can be considered as a quantitative), 'Nicotin dependence - lifetime', 'Body mass index', 'Major depression - lifetime', 'Total household income in last 12 months', age, sex, coverage by medicare and medicaid, 'Drinking status', 'Ever had 2-week period when felt sad, blue, depressed, or down most of time'.
The highest relative importance was shown by alcohol drinking status, sex, 'Ever had 2-week period when felt sad, blue, depressed, or down most of time', and coverage by Medicare. 
The misclassification rate was 0.445 so only 55.5% of sample date were correctly classified. Misclassification rate (OOB) is fluctuating from 0.373 to 0.339 during the 100 trees, and the average of this borders (0.356) was achieved by adding tenth tree. So after this tree fluctuations have become significantly lower.
Output:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Code:
data ; set new; LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
/*Counting Body Mass Index*/       BMI = 703*(S1Q24LB)/((12*S1Q24FT+S1Q24IN)**2);
LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'      S1Q6A ='Education'      TABLIFEDX ='Nicotin dependence - lifetime'      BMI = 'Body mass index'      MAJORDEPLIFE = 'Major depression - lifetime'      S1Q12A = 'Total household income in last 12 months'      AGE ='AGE'      SEX ='SEX'      S1Q14C1 = 'Currently covered by medicare'      S1Q14C2 = 'Currently covered by medicaid'      CONSUMER = 'Drinking status'      S4AQ1 ='Ever had 2-week period when felt sad, blue, depressed, or down most of time';
/*setting 2 for lifetime non-smoker and 1 for current smoker*/ if SMOKER=2 then SMOKER=1; if SMOKER=3 then SMOKER=2;
if TABLIFEDX=0 then TABLIFEDX=2;
/*excluding unknown results*/ if S7Q1=9 then S7Q1=.; if S4AQ1= 9 then S4AQ1=.; /* making 0 for NO and 1 for YES */
proc sort; by IDNUM;
proc hpforest; target SMOKER/level=nominal; input S7Q1 S4AQ1 SEX S1Q14C1 S1Q14C2 CONSUMER/level=nominal; input AGE S1Q12A S1Q6A BMI/level=interval;
run; quit;
0 notes
dmitrybachindata-blog · 8 years ago
Text
Decision tree
My tries to find some association between tobacco use status and lifetime social avoidance cases was totally unsuccessful when I have tried to use tree decision. I have decided that it doesn’t make sense to work with 98-99% error rate. That is why I decided to look for some connections with smoking among adult people. 
Cause I am looking which group of people tend to smoke, I have decided to collapse current and ex-smokers into the same group. So response categorical variable is called SMOKER with 1 for current or ex-smoker and 2 - lifetime non-smoker.
The parameters of decision tree operations were to use the entropy algorithm for growing tree and cost-complexity algorithm for pruning tree in the subtree which we can see in the picture below.
All explanatory variables are on the table below: 
Tumblr media
For the decision tree was used 41510 participants of 43093.
As a result, I have received the following tree:
Tumblr media
The first separation of the sample was using the drinking status variable. The sample was divided into two subgroups: 1)lifetime abstainers, 2)current and ex-drinkers. The lifetime abstainers were much likely (83.13%) not to smoke. Current and ex-alcohol consumers had a little bigger trend (51.31%) to be smokers(or ex-smokers) than people who have never smoked.
Further, in both cases, the subgroups were divided by age. Among abstainers people younger than 42.8 were more likely (89.98%) to be lifetime nonsmokers. The elder or equal 42.8 years old lifetime abstainers were also tending to be nonsmokers but with a lower score (78.01%). Among elder lifetime abstainers men had a lower score (66.86%) to be nonsmoker then women (81.36%).
Considering people who have ever drunk alcohol they were divided by age as well. However, the result is not so different as it was in abstainers subgroup: both younger and elder(or equal) than 42.8 were tend to be nonsmokers with a score 56.67% and 58.8% respectively. An interesting division was observed in the subgroup of younger alcohol (ex-)consumers. The division was resulted by education and people who completed a bachelor degree or higher had the higher trend (69,12%) to be nonsmoker then people with a lower educational degree(52,47% to be nonsmoker).
 Finally, we are considering the subgroup of alcohol consumers or ex-consumers who were elder then 42.8 years old or equal. This subgroup was divided by sex. Women had rate 50.38% to be nonsmokers and men 68.1% to be current or ex-smokers. 
So we can see that the only one terminal node where the tendency to be nonsmoker is absent is alcohol drinking men who are elder then 42.8 years or equal. And need to mention that for women from the same subgroup the rate for smokers(or ex) and nonsmokers are not so different (approximately 50%). 
The highest trend for nonsmoking (lifetime) have shown lifetime abstainers younger than 42.8 years (89.98%).
Looking at the model-based confused matrix I can say that 63.41%(sensitivity) of lifetime smokers and 70.17%(specificity) of lifetime nonsmokers were classified correctly.
0 notes
dmitrybachindata-blog · 8 years ago
Text
Logistic regression analysis
First of all, I was a little bit frustrated with the lecture video. Because for most assignments (except a couple of) I have tried to find an association between social avoidance or social phobia and smoking. However, my analysis was different because I have used social characteristics as response variables and smoking as explanatory. So in this assignments, I am trying to find some potential reasons for social avoidance (using the variable "ever had a strong fear of avoidance of social situation"). 
As a potential confounder, I have chosen major depression and panic disorder. After adjusting the odds of having avoidance in a lifetime was 16,5% higher for participants who smoke then who have never smoked (OR=1.165, 95% CI =1.076-1.261, p= 0.0002). For panic disorder odds ratio was higher (OR= 1.596, 95% CI = 1.384-1.839 , p<0.0001)  and for major depression is even much higher (OR= 3.676, 95% CI = 3.392-3.985 , p<0.0001). Also, I need to mention that all three 95% confidence intervals are not crossing with each other.
Before adding confounder I have checked the logistic regression only with tobacco use status. The result was higher than after adjusting (OR=1.377, 95% CI=1.275-1.487, p<0.0001). Though the odds ratio have been increased after adjusting, the association are still can be observed. However, tobacco use status doesn’t look so serious now comparing with panic disorder or depression.
Output:
Tumblr media Tumblr media
Code:
data ; set new;LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'
/*making 1 for current smoker 0 for lifetime non-smoker and excluding ex-smokers*/ if SMOKER=2 then SMOKER=.; if SMOKER=3 then SMOKER=0;
/*excluding unknown results*/ if S7Q1=9 then S7Q1=.;
/* making 0 as abscence for all variables*/ if S7Q1=2 then S7Q1=0;
proc sort; by IDNUM;
proc logistic descending; model S7Q1=SMOKER MAJORDEPLIFE PANLIFE;
run; quit;
0 notes
dmitrybachindata-blog · 8 years ago
Text
Multiple regression analysis
Hello! This is the one post for every answers so do not close it until you do not finish the peer review.
NERSAC dataset mostly includes categorical variable. So I have decided to try GAPMINDER dataset while we need to use quantitative variables.
Cause it is not my major dataset I have decided to make something a little bit strange. So I have decided to use Democracy score as a response variable. For explanatory variables, I have chosen alcohol consumption and was a little bit shocked that R-square, in this case, is 0,1857. So 18,57% of political score variance, probably, can be associated with alcohol consumption. And political score (range from -10 to 10) has a positive association with alcohol consumption. p-value<0.0001. 
Tumblr media
However, above was a short version of previous assignment and now we need more variables to check the hypothesis of a positive association between political score and alcohol consumption. I have decided to add some variable, but considering a statistically significant (p>0.05) the oil consumption and income rate. The quadratic oil consumption and income level made a model with better R-square.
As a consequence, the R-square is 0.513. Coefficients and p-value you can see below. However, the result of alcohol consumption is not so good now.
Tumblr media
All explanatory variables which were significant (p<0.05) associated with the response variable are in the table above. Though both coefficients of income are pretty small, it plays a significant role in this model ( R-square = 0,24 without income rate). Also, I have tested some other parameters as a putative confounder (like employee rate, internet using rate and urbanization etc) but p-values were higher than 0.05 (in case of internet use rate p-value was like 0.57 or for suicide level it was 0.68) 
Now we can see that alcohol consumption is not statistically significant(p-value 0.1288). Moreover, after excluding alcohol consumption, the results show that R-square is equal 0.491 (against 0.513 with it) and other coefficients are still significant. So this is a final regression model.
Tumblr media
So, the hypothesis about an association between alcohol consumption and a political score is not confirmed. And the first result was confounded.
This is QQ-plot(below) for the political score. We can see that is some declines in this model because residuals are close to normal distribution except the high and low quantiles. 
Tumblr media
Considering standardized residuals(the plot is below), the majority of them belong to the space between 2 and -2, and 3 countries(5%) are beyond that. So 95% of data is between 2 and -2 which looks like an acceptable model.
Tumblr media
Talking about leverage plot(below). The good news is that we do not have any points which are both outliers and leverages. And we can see here three outliers. So we can say that all three outliers don’t make any influence on my regression model.
Tumblr media
To conclude, the hypothesis of an association between alcohol consumption and a political score is not confirmed. However, during the looking for confounders, I have created a regression model with R^2=0.49 where political score is response variable and income rate and oil consumption rate (both linear and quadratic) as explanatory variables. Considering plots we can say the is pretty acceptable regression model, however it was considered only for 60 countries (there is not enough data).
Output (without graphs, it is pretty difficult to put it there from SAS if you know how to make it easy (because there are a lot of them) please let me know):
Tumblr media Tumblr media
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.gapminder; /*Centralizing*/ AL_c = alcconsumption - 6.6894118; IN_c = incomeperperson - 8740.97; Oil_c =  oilperperson - 1.4840852; LABEL AL_c='Alcohol consumption' polityscore = 'Political situation' IN_C = 'income' Oil_c = 'Oil consumption'; proc glm PLOTS(unpack)=all; model polityscore=Al_c IN_c IN_c*IN_c Oil_c Oil_c*Oil_c/solution clparm; output residual=res student=stdres out=results; run; proc gplot; label stdres='Standartized residual' country='Country'; plot stdres*country/vref=0; run; quit;
0 notes
dmitrybachindata-blog · 8 years ago
Text
Linear regression analysis
Previously, my explanatory variable was “tobacco use status” with 0 for lifetime non-smokers, 1 for ex-smokers and 2 for current smokers.
For this assignment, I have decided to exclude ex-smoker from the analysis. And make category 1 for smokers. The reason for excluding ex-smokers is that no significant difference between ex-smokers and other categories was revealed during some posthoc tests.
As for response variable (which was categorical too previously), I needed to find something not connected to social fear. The reason for that is because there is no adequate quantitative variable in the social phobia section. As I decided to check popular statement about smoking, I have decided to check one more. 
This is about that smoking helps you to keep your weight low. The weight depends on height too, so I decided to count a body mass index for every person.  So BMI will be response variable. 
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL SMOKER='Tobacco use status'  S1Q24LB = 'Weight in pounds'      S1Q24FT = 'Height in feets'      S1Q24IN = 'Height in inches';
/*excluding ex-smokers and make 0 for non-smokers and 1 for smokers*/ if SMOKER=3 then SMOKER=0; else if smoker = 2 then smoker =.;
/*excluding unknown results*/ If S1Q24FT=99 then S1Q24FT=.; if S1Q24IN=99 then S1Q24IN=.;
/*Counting Body Mass Index*/       BMI = 703*(S1Q24LB)/((12*S1Q24FT+S1Q24IN)**2);
proc sort; by IDNUM;
proc glm PLOTS(MAXPOINTS=40000); model BMI=SMOKER/solution;
run;
Results:
Tumblr media Tumblr media
As we can see, p-value is low than 0.0001, so the result could be considered as significant. Intercept is 29,45 and coefficient is -1,29. So we can see that 29,45 is BMI for people who have never smoked and 28.15 for current smokers.However R-square is equal 0.001023 that means that smoker status explains only 0.1% of variance of BMI.
0 notes
dmitrybachindata-blog · 8 years ago
Text
Measures
As an explanatory variable, I have used a tobacco-user status(lifetime non-smokers, ex-smokers, and current smokers) and as the response variable, the answer to “ever had strong fear or avoidance of social situation”.
As management, I have excluded the people who were not sure if they had strong fear or avoidance of social situation. In future, I plan to check some lurking variables. I have already condensed the variable “family income” into three equal group and found out that it is not a confounder. In future, I plan to make a variable with different (from 4 to 6, haven’t decide yet) and some others.
0 notes
dmitrybachindata-blog · 8 years ago
Text
NESARC - procedure
The data collection was provided by trained U.S. Census Bureau Field Representatives during 2001 and 2002 through computer-assisted personal interviews. The purpose of the survey was to collect a huge set of data related to alcohol, drug, and tobacco use as well as five mood disorders.  In every household, one adult was selected and interviewed in own home with informed consent procedures.
0 notes
dmitrybachindata-blog · 8 years ago
Text
NESARC - sample
The sample volume is 43,093 individuals living in the USA. Almost all of them were from 18 to 97 years old except 14 people who were 98 or elder. The significant part of participants (n=16156) was from the South region. Others were from Midwest, Northwest and West regions (n=8991, 8209 and  9737 respectively). Also, participants represented different occupations(huge variety of job including military, managers, administrative, handlers etc.), levels of education (from people who had no formal schooling to people who had a master’s degree or higher)  and personal income level (from 0 up to 3000000$ in last 12 months).  The survey was conducted in 2001 and 2002 by the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC).
In my observation, I have tried to find the association between tobacco-user status and experience of avoidance of social situation. That is why 1455 participants. were excluded because there weren’t sure if they ever had strong fear or avoidance of social situation.
0 notes
dmitrybachindata-blog · 8 years ago
Text
Moderation
Actually, first time I had no idea which variable can be used as a third variable. First I have tried some random. After that, I have found that the association of tobacco use status and facts of social phobia cases doesn’t depend on family income, but depend on occupation type. Unfortunately, it was not so easy to condense levels of occupation, so I have decided to use the variable related to depression, so I have used the same moderate which was used in video example.
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds; LABEL SMOKER='Tobacco use status'     S7Q1='Ever had strong fear or avoidance of social situation'     MAJORDEPLIFE= 'Major depression lifetime'; /*Some modification of smoker: 0 non-smoker, 1 ex-smoker, 2-smoker*/ IF SMOKER = 3 then SMOKER =0; ELSE IF SMOKER = 2 then SMOKER =.; ELSE IF SMOKER = 1 then SMOKER =2; IF SMOKER =. then SMOKER=1; /*18 years old or elder people*/ IF AGE>=18; /*removing unknown data*/ IF S7Q1 = 9 THEN S7Q1=.; If S7Q6 = 9 THEN S7Q6=.; IF S7Q1 = 2 THEN S7Q1 = 0; /*Class relation 1 for low, 2 for middle and 3 for high*/ PROC SORT; by IDNUM; proc gchart; vbar SMOKER/discrete type=mean SUMVAR=S7Q1; PROC Freq; Tables S7Q1*SMOKER/CHISQ; proc sort; by MAJORDEPLIFE; PROC Freq; Tables S7Q1*SMOKER/CHISQ; by MAJORDEPLIFE; run; data Comparison1; set new; if SMOKER=0 or SMOKER=1; PROC SORT; by MAJORDEPLIFE; PROC Freq; Tables S7Q1*SMOKER/CHISQ; by MAJORDEPLIFE; RUN; data Comparison2; set new; if SMOKER=0 or SMOKER=2; PROC SORT; by MAJORDEPLIFE; PROC Freq; Tables S7Q1*SMOKER/CHISQ; by MAJORDEPLIFE; RUN; data Comparison3; set new; if SMOKER=1 or SMOKER=2; PROC SORT; by MAJORDEPLIFE; PROC Freq; Tables S7Q1*SMOKER/CHISQ; by MAJORDEPLIFE; RUN;
Tumblr media Tumblr media
As we can see that p-value less than 0.05 (<0.001 and 0.0401 for absence and existence of depression experience) in both cases and tendency is still the same: among people who ever had social phobia cases current smokers(category 2) are on the first place and lifetime non-smokers(category 0) are on the last and ex-smokers(1) in the middle
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
As for posthoc test (3 comparisons so p < 0.017), we can see that there is no significant difference between lifetime non-smokers and ex-smokers in case if there was major depression and there is no any significant difference between ex-smokers and current smokers. In All other situations, the difference is significant/
So, though people who had major depression tend to have social phobia cases as well, there is still the association between social phobia cases and tobacco user status, but only if we compare lifetime non-smokers and ex-smokers
0 notes
dmitrybachindata-blog · 8 years ago
Text
Pearson correlation
Unfortunately, I don’t have a lot quantitative variables in my topic. So I decided to look for correlation between some variables. To achieve it I have made some strange variables: weight, number of children and how many times people drink beer per month.
So...
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds; LABEL WEIGHT = 'Final weight'  CHLD0_17 ='Number of children under age 18 in household'  S2AQ1 = 'Drank at least 1 alcoholic drink in life'  S2AQ5A = 'Drank any beer in last 12 month'  S2AQ5B = 'HOW OFTEN DRANK BEER IN LAST 12 MONTHS';   IF AGE>=18; /*beer drinking times per year*/ If S2AQ1=2 OR S2AQ5A=2  then BEERFREQ = 0; else if S2AQ5B=1 then BEERFREQ = 365; else if S2AQ5B=2 then BEERFREQ = 23.6*12; else if S2AQ5B=3 then BEERFREQ = 15*12; else if S2AQ5B=4 then BEERFREQ = 8.6*12; else if S2AQ5B=5 then BEERFREQ = 4.3*12; else if S2AQ5B=6 then BEERFREQ = 2.5*12; else if S2AQ5B=7 then BEERFREQ = 12; else if S2AQ5B=8 then BEERFREQ = 9; else if S2AQ5B=9 then BEERFREQ = 4.5; else if S2AQ5B=10 then BEERFREQ = 1.5; PROC SORT; by IDNUM; PROC GPLOT; Plot WEIGHT*CHLD0_17; PROC GPLOT; Plot WEIGHT*BEERFREQ; PROC GPLOT; Plot BEERFREQ*CHLD0_17; proc corr; VAR CHLD0_17 WEIGHT BEERFREQ; run;
Result:
Tumblr media
As we can see, all coefficients are close to 0. Even the largest about beer frequency and weight r=0.04946, so r^2=0.00245, so 0,245% of variance could be explained by this association. p-value<0.0001 in all cases. So we could be pretty sure about absence of linear dependence between variables.
About scatterplots. I am not showing bothscatterplots with beer frequency(they are completely random), but I have decided to show the one about weight and number of children, because it doens’t look so random. On the first glance, it looks that people who have less children tend to be more heavy.  However, it is not clear and need to be analized more accurately. 
Tumblr media
0 notes
dmitrybachindata-blog · 8 years ago
Text
Social phobia cases and tobacco use status.
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'  S7Q6='Usually became upset/anxious when had to be in these social situation'  S7Q17='Number of episodes'  S3AQ3B1='Usual frequency when smoked cigarettes';
/*Some modification of smoker: 0 non-smoker, 1 ex-smoker, 2-smoker*/ IF SMOKER = 3 then SMOKER =0; ELSE IF SMOKER = 2 then SMOKER =.; ELSE IF SMOKER = 1 then SMOKER =2; IF SMOKER =. then SMOKER=1;
/*18 years old or elder people*/ IF AGE>=18;
/*removing unknown data*/ IF S7Q1 = 9 THEN S7Q1=.; If S7Q6 = 9 THEN S7Q6=.; IF S7Q1 = 2 THEN S7Q1 = 0;
/*S7Q17C NUMBER OF EPISODES*/ IF S7Q17C=. and S7Q1 = 0 THEN S7Q17C=0; IF S7Q17C=99 then S7Q17C=.;
PROC SORT; by IDNUM;
PROC Freq; Tables S7Q1*SMOKER/CHISQ; run;
data Comparison1; set new; if SMOKER=0 or SMOKER=1; PROC SORT; by IDNUM; PROC Freq; Tables S7Q1*SMOKER/CHISQ; RUN;
data Comparison2; set new; if SMOKER=0 or SMOKER=2; PROC SORT; by IDNUM; PROC Freq; Tables S7Q1*SMOKER/CHISQ; RUN;
data Comparison3; set new; if SMOKER=1 or SMOKER=2; PROC SORT; by IDNUM; PROC Freq; Tables S7Q1*SMOKER/CHISQ; RUN;
Results:
Tumblr media
The responsive variable ‘ever had strong fear or avoidance of social situation’ has only two levels 0 for ‘No’ and 1 for ‘Yes’. As for explanatory variable, which is tobacco use status, it has 3 levels, where 0 is lifetime nonsmoker, 1 is ex-smoker and 2 is current smoker.
Among people who had social phobia cases there is non-smoker is the minor group(8,17%) comparing with ex-smokers (10,18%)  and current smokers(10,91%). X2=75.4140, 2 df, p<0.0001
So we could reject the null hypothesis, however, we need to check is there is a significant difference. Because of we have only three category, we have to do only three comparison. So after Bonferroni adjustment p must be less then 0.017.(dF=1)
Tumblr media Tumblr media Tumblr media
As we can see, that the part of lifetime nonsmokers is definitely different from ex- and current smokers (p<0.0001 in both comparisons). However we cannot be certain about the difference between ex- and current smokers because p = 0,1097.(dF=1)
0 notes
dmitrybachindata-blog · 8 years ago
Text
The number of social phobia cases and tobacco user status
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'  S7Q6='Usually became upset/anxious when had to be in these social situation'  S7Q17='Number of episodes'  S3AQ3B1='Usual frequency when smoked cigarettes';
/*Some modification of smoker: 0 non-smoker, 1 ex-smoker, 2-smoker*/ IF SMOKER = 3 then SMOKER =0; ELSE IF SMOKER = 2 then SMOKER =.; ELSE IF SMOKER = 1 then SMOKER =2; IF SMOKER =. then SMOKER=1;
/*18 years old or elder people*/ IF AGE>=18;
/*removing unknown data*/ IF S7Q1 = 9 THEN S7Q1=.; If S7Q6 = 9 THEN S7Q6=.; IF S7Q1 = 2 THEN S7Q1 = 0;
/*S7Q17C NUMBER OF EPISODES*/ IF S7Q17C=. and S7Q1 = 0 THEN S7Q17C=0; IF S7Q17C=99 then S7Q17C=.;
PROC SORT; by IDNUM;
proc anova; class smoker; model S7Q17C=smoker; means smoker;
proc anova; class smoker; model S7Q17C=smoker; means smoker/duncan; run;
ANOVA results:
Tumblr media
As usually I am using variables from tobacco section as a explanatory variable and ones from social phobia section as a response variables. 
Unfortunately there is not many quantitative variables in social phobia section. So I have decided to choose the number of cases of strong fear or avoidance of social situation and as the categorical number I have chosen tobacco user status with three levels: 0 for lifetime non-smoker, 1 for ex-smoker and 2 for current smoker.  
Actually, the number of cases is not the best variable to use this kind of analysis, because 88% of answers is 0, but it is still OK to use it as a demonstration for ANOVA.
H0 hypothesis is there is no difference in numbers of social phobia cases between any kind of tobacco user status.
For lifetime nonsmokers the result is lowest (Mean=0.48317948 and s.d. ± 4.34385228) for ex-smoker is the middle (Mean=0.59087353, s.d. ± 4.78980751) and for current smokers is the highest (Mean=0.65731881, s.d. ± 5.34404294). The result of test is F(2, 40757) = 5.3,  p = 0,005
post hoc ANOVA results:
Tumblr media
As for pair comparing of 3 categories, it has showed, that there is a significant difference between lifetime non-smoker and current smokers. However, the difference between ex-smokers is not significant comparing with any other group. 
0 notes
dmitrybachindata-blog · 8 years ago
Text
OK, does tobacco help with social problems?
ode:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds; LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'  S7Q6='Usually became upset/anxious when had to be in these social situation'  S7Q17='Number of episodes'  S3AQ3B1='Usual frequency when smoked cigarettes'; /*Some modification of smoker: 0 non-smoker, 1 ex-smoker, 2-smoker*/ IF SMOKER = 3 then SMOKER =0; ELSE IF SMOKER = 2 then SMOKER =.; ELSE IF SMOKER = 1 then SMOKER =2; IF SMOKER =. then SMOKER=1; /*18 years old or elder people*/ IF AGE>=18; /*removing unknown data*/ IF S7Q1 = 9 THEN S7Q1=.; If S7Q6 = 9 THEN S7Q6=.; IF S7Q1 = 2 THEN S7Q1 = 0; /*S7Q17C NUMBER OF EPISODES*/ IF S7Q17C=. and S7Q1 = 0 THEN S7Q17C=0; IF S7Q17C=99 then S7Q17C=.; PROC SORT; by IDNUM; PROC GCHART; VBAR SMOKER/Discrete type=percent width=30; PROC GCHART; VBAR S7Q1/Discrete type=percent width=30; PROC GCHART; VBAR S7Q17C/ type=percent; PROC GCHART; VBAR SMOKER/Discrete Type=mean SUMVAR=S7Q1 width=11; PROC GCHART; VBAR SMOKER/Discrete Type=mean SUMVAR=S7Q17C width=11; PROC FREQ; TABLES S7Q1 S7Q17C SMOKER; RUN;
------------------------------------------------------------------------------------------------
After some looking on my variables, I have recognized that the best option for explanatory variable is SMOKER, about tobacco use status, because it allows to compare not only smoker(category 2) and lifetime non-smokers(category 0) but ex-smokers(category 1) too.
This is a univariate graph for this variable:
Tumblr media
The most frequent answer is 0 (never smoked) and it is more than 50%. So graph is not uniform. Cause there is only 3 categories it is not a good idea to discuss about graph form and 
The first explanatory variable was the  “Ever had strong fear or avoidance of social situation” This is also categorical variable which contains only two category 0 for ‘No’ and 1 for ‘Yes’
Tumblr media
As we can see 90% told they have never had social problems. On the other hand almost 10% is still large sample. The distribution of frequencies is definitely unequal.
The second explanatory variable is Number of episodes of fear or avoidance of social situation. Because of 90% said ‘No’ about this problem. I have changed a little bit this variable and add answer 0, where I have included everyone who answered ‘No’ on previous question. It is easy to understand that 0 is the mode in the graph of this variable. That is why I am show this graph without answer 0 in order to tell a little bit more about its form.
Tumblr media
So we can see that this variable has skewed-right distribution and if we imagine huge meaning of 0 point, this skewed distribution will be much stronger.
As a result, I have made 1 C-C and 1 C-Q graphs. First it is the quantity of people who had strong fear and avoidance of social situation depending on tobacco use status (0 - lifetime non-smoker, 1 - ex-smoker,  2-current smoker).
Tumblr media
As we can see, among people who ever had a strong fear or avoidance of social situation there are more current-smokers comparing with ex-smokers, and lifetime non-smokers are less frequent category. So we can conclude that there is a association between tobacco use status and social phobia episodes. 
To look on the picture more precisely, I have made a second graph where I decided to check association between number of episodes of  strong fear or avoidance of social situation and tobacco use status (0 - lifetime non-smoker, 1 - ex-smoker,  2-current smoker).
Tumblr media
Here we can see the same trend. The highest number of the episodes is among the current smokers and the lowest is about lifetime non-smokers.So there is an association between Number of the episodes of social problems and tobacco use status.
What is more important. Among the whole sample lifetime non-smokers are 55%, and among people with social problem this category is minor. So it is not just the same distribution. On the other hand it doesn’t mean the direct connection between this pairs of variables. Anyway, this data shows that popular myth about smoking and social situations can be doubted.
0 notes
dmitrybachindata-blog · 8 years ago
Text
Data management and some changes in variables
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds; LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation'  S7Q6='Usually became upset/anxious when had to be in these social situation'  S3AQ3B1='Usual frequency when smoked cigarettes'; /*18 years old or elder people*/ IF AGE>=18; /*Smoking every day or don't smoke (or less then 100 cigarettes in lifetime)*/ IF S3AQ3B1 = 1 THEN EVERYDAYORNOT=1; ELSE IF S3AQ3B1 NE 1 AND S3AQ1A =2 THEN EVERYDAYORNOT=2; ELSE EVERYDAYORNOT=.; /*removing unknown data*/ IF S7Q1 = 9 THEN S7Q1=.; If S7Q6 = 9 THEN S7Q6=.; PROC SORT; by IDNUM; PROC FREQ; TABLES SMOKER S7Q1 S7Q6 EVERYDAYORNOT; RUN;
Results:
Tumblr media
I was a little bit surprised that I accidentally made some data managements previous week. Anyway It was not deliberately and I was inspired by lectures to make some changes.
First of all, when one talks about smoking helping social life, cigarettes usually are considered. And if talk about smoking and society, in my opinion, we need to consider not nicotine dependence, but smoking cigarettes frequency. Looking through frequency distribution I have found that there two huge groups of people: Everyday smokers and no-smokers. The other groups are little comparing with this two. This is why I have created second variable EVERYDAYORNOT where 1(37.59%) means Everyday smokers, and category 2 (62.41%) is people who smoked less then 100 cigarettes in their lives. 
As for tobacco use status, I don’t consider any data managements now, but may be later. Of the total number ~ 55,46% was life-time nonsmokers(category 3). The second place majority (25,8%) falls into category 1(current users). Category 2 is ex-smokers have 18,74% of sample.
I think that variables connected with social problems don’t need to be managed except some trimming with “unknown answer”. 
Considering question “Ever had strong fear or avoidance of social situation” the majority is the answer ‘No’ (90,74%). Though the category 1 (Yes) is only 9,26%, it is still large sample volume (n=3854). Also there is Frequency missing is 1455.
The question “Usually became upset/anxious when had to be in these social situation” has a huge amount of frequency missing(37712). However the other part is still enough to make a little research(n = 5381) and the distribution is 58.74% and 41.26% for “yes” and “no” respectively.
0 notes
dmitrybachindata-blog · 8 years ago
Text
First steps and sample condensing
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds; LABEL SMOKER='Tobacco use status'      S7Q1='Ever had strong fear or avoidance of social situation' S7Q6='Usually became upset/anxious when had to be in these social situation'; /*18 years old or elder people who was sure about past problems in social situations (or lack of problems)*/ IF S7Q6=2 or S7Q6=1; IF S7Q1=2 or S7Q1=1; IF AGE>=18; PROC SORT; by IDNUM; PROC FREQ; TABLES SMOKER S7Q1 S7Q6; RUN;
Results:
Tumblr media
First of all, I had to condense a sample in order to exclude people who was not sure about social problem experience (or lack or it). The reason for that is because in this case data could be unreliable and there would be a lot of missing points. Also I have decided to consider people was 18 or elder because if we talk about myth “Smoking improves social adaptation” we often consider adult people.
I have used three variables. First was about tobacco use status. Of the total number ~ 50,53% was life-time nonsmokers(category 3). The second place majority (29,36%) falls into category 1(current users). Category 2 is ex-smokers have 20,1% of sample.
Among the people who ever had strong fear or avoidance of social situation(second chosen variable) a significant part (71,51%) belongs to category 1(answer ‘yes’). Category 2 is answer ‘no’. As I mentioned before I have excluded people who wasn’t sure about the answer.
The third variable is “Usually became upset/anxious when had to be in these social situation. A little bit more than half (58,8%) of people fell in category 1(’yes’). The answer ‘No’ was category 2(41,2%).
To summarize, although I had to exclude a lot of respondents from the research, the sample volume is still large (and bigger than most research in this field). Every type of answer was given by a significant quantity of people, so I have a good chance to be close to general population.
0 notes