jprosas - Tumblr blog

jprosas · 5 years ago

Text

Regression Modeling in Practice Week 4

In this section we Tfitte a logistic regression model beeing “SOCPDLIFE”, “MAJORDEPLIFE” and “numbercigsmoked_c” the explanatory variables and “NICOTINEDEP” the response variable:

Lower CI Upper CI OR p-value Intercept 0.99 1.29 1.13 0.065 SOCPDLIFE 1.15 4.83 2.36 0.019 MAJORDEPLIFE 2.78 5.11 3.77 <0.0001 numbercigsmoked_c 1.04 1.07 1.05 <0.0001

As we can see for our primary explanatory variable (MAJORDEPLIFE) the OR=3.77 means that someone with depression are 3.77 times more likely to be a nicotine dependent person than someone who doesn´t have depression. But let´s keep in mind that this ratio can go from 2.78 to 5.11 at a 95% of confidence.

All the confounding variables have a p-value less than 0.05 so they are significant.

INPUTS AND OUTPUTS

sub1['age_c']=(sub1['AGE'] - sub1['AGE'].mean())

sub1['numbercigsmoked_c'] = (sub1['numbercigsmoked'] - sub1['numbercigsmoked'].mean())

# logistic regression with social phobia, depression and numercigsmoked_c logmod = smf.logit(formula = 'NICOTINEDEP ~ SOCPDLIFE + MAJORDEPLIFE + numbercigsmoked_c', data = sub1).fit() print (logmod.summary())

#as we can see they all are significant variables

# odd ratios with 95% confidence intervals params = logmod.params conf = logmod.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf)) Optimization terminated successfully. Current function value: 0.613855 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: NICOTINEDEP No. Observations: 1315 Model: Logit Df Residuals: 1311 Method: MLE Df Model: 3 Date: Fri, 13 Nov 2020 Pseudo R-squ.: 0.08399 Time: 03:47:55 Log-Likelihood: -807.22 converged: True LL-Null: -881.23 Covariance Type: nonrobust LLR p-value: 7.025e-32 ===================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------- Intercept 0.1233 0.067 1.849 0.065 -0.007 0.254 SOCPDLIFE 0.8587 0.365 2.353 0.019 0.144 1.574 MAJORDEPLIFE 1.3281 0.155 8.560 0.000 1.024 1.632 numbercigsmoked_c 0.0511 0.008 6.305 0.000 0.035 0.067 ===================================================================================== Lower CI Upper CI OR Intercept 0.99 1.29 1.13 SOCPDLIFE 1.15 4.83 2.36 MAJORDEPLIFE 2.78 5.11 3.77 numbercigsmoked_c 1.04 1.07 1.05

0 notes

jprosas · 5 years ago

Text

Regression Modeling in Practice Week 3

In this section we fitted a multiple linear model beeing “DYSLIFE”, ”MAJORDEPLIFE”, “numbercigsmoked_c” and “age_c” the explanatory variables and “NDSymptoms” the response variable.

We got the next model:

| ESTIMATE | p-value

Intercept | 2.1908 | 0.000 DYSLIFE | 0.2686 | 0.197 MAJORDEPLIFE | 1.2917 | 0.000 numbercigsmoked_c | 0.0357 | 0.000 age_c |-0.0410 | 0.062

And we can see that age_c and DYSLIFE doesn’t pass the significance test while using the other variables.

Also, we have an R-squared: 0.136 wich is not very good.

But as we expected MAJORDEPLIFE has positive significant linear relation with NDSymptoms.

Next we can see some graphs to analize the residuals of our model:

* Q-Q plot

in this plot we can see that the residuals of our model doesn´t follow exactly a normal distribution.

* standardized residuals for all observations

Almost all of our observations are within a absolut value of 2.

*leverage

Here we can see that most of the observations tha´t are not within a absolut value of 2 have a really low leverage, and the two points that we can see that are more influential actually are within a absolut value of 2 so we can´t think of them as outliers.

INPUTS

full_model = smf.ols('NDSymptoms ~ DYSLIFE + MAJORDEPLIFE + numbercigsmoked_c + age_c', data=sub1).fit() print (full_model.summary())

fig1=sm.qqplot(full_model.resid,line='r')

stdres=pandas.DataFrame(full_model.resid_pearson) fig2=plt.plot(stdres,'o',ls='None') l=plt.axhline(y=0,color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')

fig3=sm.graphics.influence_plot(full_model,size=6)

0 notes

jprosas · 5 years ago

Text

Regression Modeling in Practice Week 2

In this section we fitted a linear model for the response variable “numbercigsmoked” using only the categorical explanatory variable “NICOTINEDEP”.

The results indicated that if a person is classified as nicotine dependent smokes 3.2139 more cigarettes per day than someone who’s not classified as nicotine dependent.

So we got a estimate of 3.2139 with a p-value=.0001, so this is a significant relation.

We can also see that there´s more nicotine dependent people than those who are not in the next frequency table:

NICOTINEDEP 0 517 1 798

Here´s the code that was used for this section:

sub1 = sub1[['numbercigsmoked', 'NICOTINEDEP']].dropna()

reg1 = smf.ols('numbercigsmoked ~ NICOTINEDEP', data=sub1).fit() print (reg1.summary())

my_tab = pandas.crosstab(index=sub1['NICOTINEDEP'], columns="count")

0 notes

jprosas · 5 years ago

Text

Understanding some relations between the use of drugs and mental health

SAMPLE

The sample consists of 43,049 observations, collected from a survey that was made individually for the adult population of the United States.

The survey is from the first wave of the National Epidemiologic Survey on Alcohol (NESARC).

The data analytic sample for this study included participants 18-25 years old who reported smoking at least 1 cigarette per day in the past 30 days (1,320).

PROCEDURE

Data were collected during 2001-2002 by trained U.S. Census Bureau Field Representatives, interviewing each one by one while they answered the survey.

MEASURES

To see if the person is currently smoking we used smoking frequency and quantify, so we can know if there’s presence or not of daily smoking and we may quantify how many cigarettes per day the person smokes (going from 1 to 98).

1 note · View note