jprosas
jprosas
Regression Modeling in practice
4 posts
Don't wanna be here? Send us removal request.
jprosas · 5 years ago
Text
Regression Modeling in Practice Week 4
In this section we Tfitte a logistic regression model beeing “SOCPDLIFE”, “MAJORDEPLIFE” and  “numbercigsmoked_c” the explanatory variables and “NICOTINEDEP” the response variable:
                                     Lower CI       Upper CI       OR      p-value Intercept                          0.99          1.29              1.13       0.065 SOCPDLIFE                   1.15           4.83              2.36      0.019 MAJORDEPLIFE            2.78           5.11              3.77       <0.0001 numbercigsmoked_c      1.04           1.07              1.05       <0.0001
As we can see for our primary explanatory variable (MAJORDEPLIFE) the OR=3.77 means that someone with depression are 3.77 times more likely to be a nicotine dependent person than someone who doesn´t have depression. But let´s keep in mind that this ratio can go from 2.78 to 5.11 at a 95% of confidence.
All the confounding variables have a p-value less than 0.05 so they are significant. 
INPUTS AND OUTPUTS
sub1['age_c']=(sub1['AGE'] - sub1['AGE'].mean())
sub1['numbercigsmoked_c'] = (sub1['numbercigsmoked'] - sub1['numbercigsmoked'].mean())
# logistic regression with social phobia, depression and numercigsmoked_c logmod = smf.logit(formula = 'NICOTINEDEP ~ SOCPDLIFE + MAJORDEPLIFE + numbercigsmoked_c', data = sub1).fit() print (logmod.summary())
#as we can see they all are significant variables
# odd ratios with 95% confidence intervals params = logmod.params conf = logmod.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf)) Optimization terminated successfully.         Current function value: 0.613855         Iterations 6                           Logit Regression Results                           ============================================================================== Dep. Variable:            NICOTINEDEP   No. Observations:                 1315 Model:                          Logit   Df Residuals:                     1311 Method:                           MLE   Df Model:                            3 Date:                Fri, 13 Nov 2020   Pseudo R-squ.:                 0.08399 Time:                        03:47:55   Log-Likelihood:                -807.22 converged:                       True   LL-Null:                       -881.23 Covariance Type:            nonrobust   LLR p-value:                 7.025e-32 =====================================================================================                        coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------- Intercept             0.1233      0.067      1.849      0.065      -0.007       0.254 SOCPDLIFE             0.8587      0.365      2.353      0.019       0.144       1.574 MAJORDEPLIFE          1.3281      0.155      8.560      0.000       1.024       1.632 numbercigsmoked_c     0.0511      0.008      6.305      0.000       0.035       0.067 =====================================================================================                   Lower CI  Upper CI   OR Intercept              0.99      1.29 1.13 SOCPDLIFE              1.15      4.83 2.36 MAJORDEPLIFE           2.78      5.11 3.77 numbercigsmoked_c      1.04      1.07 1.05
0 notes
jprosas · 5 years ago
Text
Regression Modeling in Practice Week 3
In this section we fitted a multiple linear model beeing “DYSLIFE”, ”MAJORDEPLIFE”, “numbercigsmoked_c” and “age_c” the explanatory variables and “NDSymptoms” the response variable.
We got the next model:
                                  |  ESTIMATE |  p-value
Intercept                    |    2.1908      |     0.000        DYSLIFE                    |   0.2686       |    0.197       MAJORDEPLIFE        |  1.2917       |    0.000        numbercigsmoked_c   |  0.0357      |    0.000        age_c                           |-0.0410       |   0.062     
And we can see that age_c and DYSLIFE doesn’t pass the significance test while using the other variables.
Also, we have an R-squared:  0.136 wich is not very good.
But as we expected MAJORDEPLIFE has positive significant linear relation with NDSymptoms.
Next we can see some graphs to  analize the residuals of our model:
* Q-Q plot
Tumblr media
in this plot we can see that the residuals of our model doesn´t follow exactly a normal distribution.
* standardized residuals for all observations
Tumblr media
Almost all of our observations are within a absolut value of 2.
*leverage
Tumblr media
Here we can see that most of the observations tha´t are not within a absolut value of 2 have a really low leverage, and the two points that we can see that are more influential actually are within a absolut value of 2 so we can´t think of them as outliers.
INPUTS
full_model = smf.ols('NDSymptoms ~ DYSLIFE + MAJORDEPLIFE + numbercigsmoked_c + age_c', data=sub1).fit() print (full_model.summary())
fig1=sm.qqplot(full_model.resid,line='r')
stdres=pandas.DataFrame(full_model.resid_pearson) fig2=plt.plot(stdres,'o',ls='None') l=plt.axhline(y=0,color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')
fig3=sm.graphics.influence_plot(full_model,size=6)
0 notes
jprosas · 5 years ago
Text
Regression Modeling in Practice Week 2
In this section we fitted a linear model for the response variable “numbercigsmoked” using only the categorical explanatory variable “NICOTINEDEP”.
The results indicated that if a person is classified as nicotine dependent smokes 3.2139 more cigarettes per day than someone who’s not classified as nicotine dependent.
So we got a estimate of 3.2139 with a p-value=.0001, so this is a significant relation.
We can also see that there´s more nicotine dependent people than those who are not in the next frequency table:
NICOTINEDEP       0              517 1              798
Here´s the code that was used for this section: 
sub1 = sub1[['numbercigsmoked', 'NICOTINEDEP']].dropna()
reg1 = smf.ols('numbercigsmoked ~ NICOTINEDEP', data=sub1).fit() print (reg1.summary())
my_tab = pandas.crosstab(index=sub1['NICOTINEDEP'], columns="count")
0 notes
jprosas · 5 years ago
Text
Understanding some relations between the use of drugs and mental health
SAMPLE
The sample consists of 43,049 observations, collected from a survey that was made individually for the adult population of the United States.
The survey is from the first wave of the National Epidemiologic Survey on Alcohol (NESARC).
The  data analytic sample for this study included participants 18-25 years old who reported smoking at least 1 cigarette per day in the past 30 days (1,320).
PROCEDURE
Data were collected during 2001-2002 by trained U.S. Census Bureau Field Representatives, interviewing each one by one while they answered the survey.
MEASURES
To see if the person is currently smoking we used smoking frequency and quantify, so we can know if there’s presence or not of daily smoking and we may quantify how many cigarettes per day the person smokes (going from 1 to 98).
1 note · View note