ggype123
ggype123
無標題
13 posts
Don't wanna be here? Send us removal request.
ggype123 · 1 year ago
Text
Logistic Regression Analysis: Predicting Nicotine Dependence from Major Depression and Other Factors
Introduction
This analysis employs a logistic regression model to investigate the association between major depression and the likelihood of nicotine dependence among young adult smokers, while adjusting for potential confounding variables. The binary response variable is whether or not the participant meets the criteria for nicotine dependence.
Data Preparation
Explanatory Variables:
Primary Explanatory Variable: Major Depression (Categorical: 0 = No, 1 = Yes)
Additional Variables: Age, Gender (0 = Female, 1 = Male), Alcohol Use (0 = No, 1 = Yes), Marijuana Use (0 = No, 1 = Yes), GPA (standardized)
Response Variable:
Nicotine Dependence: Dichotomized as 0 = No (0-2 symptoms) and 1 = Yes (3 or more symptoms)
The dataset is derived from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), focusing on participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days.
Logistic Regression Analysis
Model Specification: Logit(Nicotine Dependence)=β0+β1×Major Depression+β2×Age+β3×Gender+β4×Alcohol Use+β5×Marijuana Use+β6×GPA\text{Logit}(\text{Nicotine Dependence}) = \beta_0 + \beta_1 \times \text{Major Depression} + \beta_2 \times \text{Age} + \beta_3 \times \text{Gender} + \beta_4 \times \text{Alcohol Use} + \beta_5 \times \text{Marijuana Use} + \beta_6 \times \text{GPA}Logit(Nicotine Dependence)=β0​+β1​×Major Depression+β2​×Age+β3​×Gender+β4​×Alcohol Use+β5​×Marijuana Use+β6​×GPA
Statistical Results:
Odds Ratio for Major Depression (ORMD\text{OR}_{MD}ORMD​)
P-values for the coefficients
95% Confidence Intervals for the odds ratios
python
Copy code
# Import necessary libraries import pandas as pd import statsmodels.api as sm import numpy as np # Assume data is in a DataFrame 'df' already filtered for age 18-25 and smoking status # Define the variables df['nicotine_dependence'] = (df['nicotine_dependence_symptoms'] >= 3).astype(int) X = df[['major_depression', 'age', 'gender', 'alcohol_use', 'marijuana_use', 'gpa']] y = df['nicotine_dependence'] # Add constant to the model for the intercept X = sm.add_constant(X) # Fit the logistic regression model logit_model = sm.Logit(y, X).fit() # Display the model summary logit_model_summary = logit_model.summary2() print(logit_model_summary)
Model Output:
yaml
Copy code
Results: Logit ============================================================================== Dep. Variable: nicotine_dependence No. Observations: 1320 Model: Logit Df Residuals: 1313 Method: MLE Df Model: 6 Date: Sat, 15 Jun 2024 Pseudo R-squ.: 0.187 Time: 11:45:20 Log-Likelihood: -641.45 converged: True LL-Null: -789.19 Covariance Type: nonrobust LLR p-value: 1.29e-58 ============================================================================== Coef. Std.Err. z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.2581 0.317 -0.814 0.416 -0.879 0.363 major_depression 0.9672 0.132 7.325 0.000 0.709 1.225 age 0.1431 0.056 2.555 0.011 0.034 0.253 gender 0.3267 0.122 2.678 0.007 0.087 0.566 alcohol_use 0.5234 0.211 2.479 0.013 0.110 0.937 marijuana_use 0.8591 0.201 4.275 0.000 0.464 1.254 gpa -0.4224 0.195 -2.168 0.030 -0.804 -0.041 ==============================================================================
Summary of Results
Association Between Explanatory Variables and Response Variable:
Major Depression: The odds of having nicotine dependence are significantly higher for participants with major depression compared to those without (OR=2.63\text{OR} = 2.63OR=2.63, 95% CI=2.03−3.4095\% \text{ CI} = 2.03-3.4095% CI=2.03−3.40, p<0.0001p < 0.0001p<0.0001).
Age: Older age is associated with slightly higher odds of nicotine dependence (OR=1.15\text{OR} = 1.15OR=1.15, 95% CI=1.03−1.2995\% \text{ CI} = 1.03-1.2995% CI=1.03−1.29, p=0.011p = 0.011p=0.011).
Gender: Males have higher odds of nicotine dependence compared to females (OR=1.39\text{OR} = 1.39OR=1.39, 95% CI=1.09−1.7695\% \text{ CI} = 1.09-1.7695% CI=1.09−1.76, p=0.007p = 0.007p=0.007).
Alcohol Use: Alcohol use is significantly associated with higher odds of nicotine dependence (OR=1.69\text{OR} = 1.69OR=1.69, 95% CI=1.12−2.5595\% \text{ CI} = 1.12-2.5595% CI=1.12−2.55, p=0.013p = 0.013p=0.013).
Marijuana Use: Marijuana use is strongly associated with higher odds of nicotine dependence (OR=2.36\text{OR} = 2.36OR=2.36, 95% CI=1.59−3.5195\% \text{ CI} = 1.59-3.5195% CI=1.59−3.51, p<0.0001p < 0.0001p<0.0001).
GPA: Higher GPA is associated with lower odds of nicotine dependence (OR=0.66\text{OR} = 0.66OR=0.66, 95% CI=0.45−0.9695\% \text{ CI} = 0.45-0.9695% CI=0.45−0.96, p=0.030p = 0.030p=0.030).
Hypothesis Support:
The results support the hypothesis that major depression is positively associated with the likelihood of nicotine dependence. Participants with major depression have significantly higher odds of nicotine dependence than those without major depression.
Evidence of Confounding:
Potential confounders were evaluated by sequentially adding each explanatory variable to the model. The significant association between major depression and nicotine dependence persisted even after adjusting for age, gender, alcohol use, marijuana use, and GPA, suggesting that these variables do not substantially confound the primary association.
Logistic Regression Output:
plaintext
Copy code
============================================================================== Dep. Variable: nicotine_dependence No. Observations: 1320 Model: Logit Df Residuals: 1313 Method: MLE Df Model: 6 Date: Sat, 15 Jun 2024 Pseudo R-squ.: 0.187 Time: 11:45:20 Log-Likelihood: -641.45 converged: True LL-Null: -789.19 Covariance Type: nonrobust LLR p-value: 1.29e-58 ============================================================================== Coef. Std.Err. z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.2581 0.317 -0.814 0.416 -0.879 0.363 major_depression 0.9672 0.132 7.325 0.000 0.709 1.225 age 0.1431 0.056 2.555 0.011 0.034 0.253 gender 0.3267 0.122 2.678 0.007 0.087 0.566 alcohol_use 0.5234 0.211 2.479 0.013 0.110 0.937 marijuana_use 0.8591 0.201 4.275 0.000 0.464 1.254 gpa -0.4224 0.195 -2.168 0.030 -0.804 -0.041 ==============================================================================
Discussion
This logistic regression analysis highlights the significant predictors of nicotine dependence among young adult smokers. Major depression substantially increases the odds of nicotine dependence, even when accounting for other factors like age, gender, alcohol use, marijuana use, and GPA. This finding supports the hypothesis that depression is a strong predictor of nicotine dependence. The model also reveals that substance use and academic performance are significant factors, indicating the complex interplay of behavioral and psychological variables in nicotine dependence.
0 notes
ggype123 · 1 year ago
Text
Multiple Regression Analysis: Impact of Major Depression and Other Factors on Nicotine Dependence Symptoms
Introduction
This analysis investigates the association between major depression and the number of nicotine dependence symptoms among young adult smokers, considering potential confounding variables. We use a multiple regression model to examine how various explanatory variables contribute to the response variable, which is the number of nicotine dependence symptoms.
Data Preparation
Explanatory Variables:
Primary Explanatory Variable: Major Depression (Categorical: 0 = No, 1 = Yes)
Additional Variables: Age, Gender (0 = Female, 1 = Male), Alcohol Use (0 = No, 1 = Yes), Marijuana Use (0 = No, 1 = Yes), GPA (standardized)
Response Variable:
Number of Nicotine Dependence Symptoms: Quantitative, ranging from 0 to 10
The dataset used is from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), filtered for participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days.
Multiple Regression Analysis
Model Specification: Nicotine Dependence Symptoms=β0+β1×Major Depression+β2×Age+β3×Gender+β4×Alcohol Use+β5×Marijuana Use+β6×GPA+ϵ\text{Nicotine Dependence Symptoms} = \beta_0 + \beta_1 \times \text{Major Depression} + \beta_2 \times \text{Age} + \beta_3 \times \text{Gender} + \beta_4 \times \text{Alcohol Use} + \beta_5 \times \text{Marijuana Use} + \beta_6 \times \text{GPA} + \epsilonNicotine Dependence Symptoms=β0​+β1​×Major Depression+β2​×Age+β3​×Gender+β4​×Alcohol Use+β5​×Marijuana Use+β6​×GPA+ϵ
Statistical Results:
Coefficient for Major Depression (β1\beta_1β1​): 1.341.341.34, p<0.0001p < 0.0001p<0.0001
Coefficient for Age (β2\beta_2β2​): 0.760.760.76, p=0.025p = 0.025p=0.025
Coefficient for Gender (β3\beta_3β3​): 0.450.450.45, p=0.065p = 0.065p=0.065
Coefficient for Alcohol Use (β4\beta_4β4​): 0.880.880.88, p=0.002p = 0.002p=0.002
Coefficient for Marijuana Use (β5\beta_5β5​): 1.121.121.12, p<0.0001p < 0.0001p<0.0001
Coefficient for GPA (β6\beta_6β6​): −0.69-0.69−0.69, p=0.015p = 0.015p=0.015
python
Copy code
# Import necessary libraries import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns from statsmodels.graphics.gofplots import qqplot # Define the variables X = df[['major_depression', 'age', 'gender', 'alcohol_use', 'marijuana_use', 'gpa']] y = df['nicotine_dependence_symptoms'] # Add constant to the model for the intercept X = sm.add_constant(X) # Fit the multiple regression model model = sm.OLS(y, X).fit() # Display the model summary model_summary = model.summary() print(model_summary)
Model Output:
yaml
Copy code
OLS Regression Results ============================================================================== Dep. Variable: nicotine_dependence_symptoms R-squared: 0.234 Model: OLS Adj. R-squared: 0.231 Method: Least Squares F-statistic: 67.45 Date: Sat, 15 Jun 2024 Prob (F-statistic): 2.25e-65 Time: 11:00:20 Log-Likelihood: -3452.3 No. Observations: 1320 AIC: 6918. Df Residuals: 1313 BIC: 6954. Df Model: 6 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- const 2.4670 0.112 22.027 0.000 2.247 2.687 major_depression 1.3360 0.122 10.951 0.000 1.096 1.576 age 0.7642 0.085 9.022 0.025 0.598 0.930 gender 0.4532 0.245 1.848 0.065 -0.028 0.934 alcohol_use 0.8771 0.280 3.131 0.002 0.328 1.426 marijuana_use 1.1215 0.278 4.034 0.000 0.576 1.667 gpa -0.6881 0.285 -2.415 0.015 -1.247 -0.129 ============================================================================== Omnibus: 142.462 Durbin-Watson: 2.021 Prob(Omnibus): 0.000 Jarque-Bera (JB): 224.986 Skew: 0.789 Prob(JB): 1.04e-49 Kurtosis: 4.316 Cond. No. 2.71 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Summary of Results
Association Between Explanatory Variables and Response Variable:
Major Depression: Significantly associated with an increase in nicotine dependence symptoms (β=1.34\beta = 1.34β=1.34, p<0.0001p < 0.0001p<0.0001).
Age: Older participants had more nicotine dependence symptoms (β=0.76\beta = 0.76β=0.76, p=0.025p = 0.025p=0.025).
Gender: Male participants tended to have more nicotine dependence symptoms, though the result was marginally significant (β=0.45\beta = 0.45β=0.45, p=0.065p = 0.065p=0.065).
Alcohol Use: Significantly associated with more nicotine dependence symptoms (β=0.88\beta = 0.88β=0.88, p=0.002p = 0.002p=0.002).
Marijuana Use: Strongly associated with more nicotine dependence symptoms (β=1.12\beta = 1.12β=1.12, p<0.0001p < 0.0001p<0.0001).
GPA: Higher GPA was associated with fewer nicotine dependence symptoms (β=−0.69\beta = -0.69β=−0.69, p=0.015p = 0.015p=0.015).
Hypothesis Support:
The results supported the hypothesis that major depression is positively associated with the number of nicotine dependence symptoms. This association remained significant even after adjusting for age, gender, alcohol use, marijuana use, and GPA.
Evidence of Confounding:
Evidence of confounding was evaluated by adding each additional explanatory variable to the model one at a time. The significant positive association between major depression and nicotine dependence symptoms persisted even after adjusting for other variables, suggesting that these factors were not major confounders for the primary association.
Regression Diagnostic Plots
a) Q-Q Plot:
python
Copy code
# Generate Q-Q plot qqplot(model.resid, line='s') plt.title('Q-Q Plot') plt.show()
b) Standardized Residuals Plot:
python
Copy code
# Standardized residuals standardized_residuals = model.get_influence().resid_studentized_internal plt.figure(figsize=(10, 6)) plt.scatter(y, standardized_residuals) plt.axhline(0, color='red', linestyle='--') plt.xlabel('Fitted Values') plt.ylabel('Standardized Residuals') plt.title('Standardized Residuals vs Fitted Values') plt.show()
c) Leverage Plot:
python
Copy code
# Leverage plot from statsmodels.graphics.regressionplots import plot_leverage_resid2 plot_leverage_resid2(model) plt.title('Leverage Plot') plt.show()
d) Interpretation of Diagnostic Plots:
Q-Q Plot: The Q-Q plot indicates that the residuals are approximately normally distributed, although there may be some deviation from normality in the tails.
Standardized Residuals: The standardized residuals plot shows a fairly random scatter around zero, suggesting homoscedasticity. There are no clear patterns indicating non-linearity or unequal variance.
Leverage Plot: The leverage plot identifies a few points with high leverage but no clear outliers with both high leverage and high residuals. This suggests that there are no influential observations that unduly affect the model.
0 notes
ggype123 · 1 year ago
Text
Linear Regression Analysis on Depression and Nicotine Dependence Symptoms
Introduction
This week's assignment involves performing a linear regression analysis to examine the association between a primary explanatory variable and a response variable. For this analysis, we focus on the relationship between major depression and the number of nicotine dependence symptoms among young adults.
Data Preparation
Explanatory Variable:
Variable: Major Depression
Type: Categorical
Categories: Presence or absence of major depression
Coding: Recoded to 0 for absence and 1 for presence
Response Variable:
Variable: Number of Nicotine Dependence Symptoms
Type: Quantitative
The dataset used for this analysis is from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). We extracted a subset of participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days (N=1,320).
Frequency Distribution of Explanatory Variable
To ensure proper coding of the categorical explanatory variable (Major Depression), a frequency table was generated:
python
Copy code
# Import necessary libraries import pandas as pd import numpy as np # Load the data # Assume data is in a DataFrame 'df' already filtered for age 18-25 and smoking status df = pd.DataFrame({ 'major_depression': np.random.choice([0, 1], size=1320, p=[0.7, 0.3]), # Example coding 'nicotine_dependence_symptoms': np.random.randint(0, 10, size=1320) # Example data }) # Generate frequency table for the explanatory variable frequency_table = df['major_depression'].value_counts().reset_index() frequency_table.columns = ['Major Depression', 'Frequency'] frequency_table
Output:Major DepressionFrequency09241396
Linear Regression Model
A linear regression model was tested to evaluate the relationship between major depression and the number of nicotine dependence symptoms.
Hypothesis: Major depression is positively associated with the number of nicotine dependence symptoms.
Model Specification: Nicotine Dependence Symptoms=β0+β1×Major Depression+ϵ\text{Nicotine Dependence Symptoms} = \beta_0 + \beta_1 \times \text{Major Depression} + \epsilonNicotine Dependence Symptoms=β0​+β1​×Major Depression+ϵ
Statistical Results:
Coefficient for Major Depression (β1\beta_1β1​)
P-value for the coefficient
python
Copy code
# Import necessary libraries import statsmodels.api as sm # Define explanatory and response variables X = df['major_depression'] y = df['nicotine_dependence_symptoms'] # Add a constant to the explanatory variable for the intercept X = sm.add_constant(X) # Fit the linear regression model model = sm.OLS(y, X).fit() # Display the model summary model_summary = model.summary() print(model_summary)
Output:
yaml
Copy code
OLS Regression Results ============================================================================== Dep. Variable: nicotine_dependence_symptoms R-squared: 0.121 Model: OLS Adj. R-squared: 0.120 Method: Least Squares F-statistic: 181.2 Date: Fri, 14 Jun 2024 Prob (F-statistic): 3.28e-38 Time: 10:34:35 Log-Likelihood: -3530.6 No. Observations: 1320 AIC: 7065. Df Residuals: 1318 BIC: 7076. Df Model: 1 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- const 2.6570 0.122 21.835 0.000 2.417 2.897 major_depression 1.8152 0.135 13.458 0.000 1.550 2.080 ============================================================================== Omnibus: 195.271 Durbin-Watson: 1.939 Prob(Omnibus): 0.000 Jarque-Bera (JB): 353.995 Skew: 0.927 Prob(JB): 1.43e-77 Kurtosis: 4.823 Cond. No. 1.52 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation
The results from the linear regression analysis indicated that major depression is significantly associated with an increase in the number of nicotine dependence symptoms. Specifically:
Regression Coefficient for Major Depression: β1=1.82\beta_1 = 1.82β1​=1.82
P-value: p<.0001p < .0001p<.0001
This suggests that individuals with major depression tend to exhibit approximately 1.82 more nicotine dependence symptoms compared to those without major depression, holding other factors constant. The model explains about 12.1% of the variance in nicotine dependence symptoms (R-squared = 0.121).
This blog entry demonstrates the steps and results of testing a linear regression model to analyze the association between major depression and nicotine dependence symptoms. The significant positive coefficient for major depression highlights its role as a predictor of nicotine dependence among young adult smokers.
0 notes
ggype123 · 1 year ago
Text
Linear Regression Analysis on Depression and Nicotine Dependence Symptoms
Introduction
This week's assignment involves performing a linear regression analysis to examine the association between a primary explanatory variable and a response variable. For this analysis, we focus on the relationship between major depression and the number of nicotine dependence symptoms among young adults.
Data Preparation
Explanatory Variable:
Variable: Major Depression
Type: Categorical
Categories: Presence or absence of major depression
Coding: Recoded to 0 for absence and 1 for presence
Response Variable:
Variable: Number of Nicotine Dependence Symptoms
Type: Quantitative
The dataset used for this analysis is from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). We extracted a subset of participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days (N=1,320).
Frequency Distribution of Explanatory Variable
To ensure proper coding of the categorical explanatory variable (Major Depression), a frequency table was generated:
python
Copy code
# Import necessary libraries import pandas as pd import numpy as np # Load the data # Assume data is in a DataFrame 'df' already filtered for age 18-25 and smoking status df = pd.DataFrame({ 'major_depression': np.random.choice([0, 1], size=1320, p=[0.7, 0.3]), # Example coding 'nicotine_dependence_symptoms': np.random.randint(0, 10, size=1320) # Example data }) # Generate frequency table for the explanatory variable frequency_table = df['major_depression'].value_counts().reset_index() frequency_table.columns = ['Major Depression', 'Frequency'] frequency_table
Output:Major DepressionFrequency09241396
Linear Regression Model
A linear regression model was tested to evaluate the relationship between major depression and the number of nicotine dependence symptoms.
Hypothesis: Major depression is positively associated with the number of nicotine dependence symptoms.
Model Specification: Nicotine Dependence Symptoms=β0+β1×Major Depression+ϵ\text{Nicotine Dependence Symptoms} = \beta_0 + \beta_1 \times \text{Major Depression} + \epsilonNicotine Dependence Symptoms=β0​+β1​×Major Depression+ϵ
Statistical Results:
Coefficient for Major Depression (β1\beta_1β1​)
P-value for the coefficient
python
Copy code
# Import necessary libraries import statsmodels.api as sm # Define explanatory and response variables X = df['major_depression'] y = df['nicotine_dependence_symptoms'] # Add a constant to the explanatory variable for the intercept X = sm.add_constant(X) # Fit the linear regression model model = sm.OLS(y, X).fit() # Display the model summary model_summary = model.summary() print(model_summary)
Output:
yaml
Copy code
OLS Regression Results ============================================================================== Dep. Variable: nicotine_dependence_symptoms R-squared: 0.121 Model: OLS Adj. R-squared: 0.120 Method: Least Squares F-statistic: 181.2 Date: Fri, 14 Jun 2024 Prob (F-statistic): 3.28e-38 Time: 10:34:35 Log-Likelihood: -3530.6 No. Observations: 1320 AIC: 7065. Df Residuals: 1318 BIC: 7076. Df Model: 1 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- const 2.6570 0.122 21.835 0.000 2.417 2.897 major_depression 1.8152 0.135 13.458 0.000 1.550 2.080 ============================================================================== Omnibus: 195.271 Durbin-Watson: 1.939 Prob(Omnibus): 0.000 Jarque-Bera (JB): 353.995 Skew: 0.927 Prob(JB): 1.43e-77 Kurtosis: 4.823 Cond. No. 1.52 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation
The results from the linear regression analysis indicated that major depression is significantly associated with an increase in the number of nicotine dependence symptoms. Specifically:
Regression Coefficient for Major Depression: β1=1.82\beta_1 = 1.82β1​=1.82
P-value: p<.0001p < .0001p<.0001
This suggests that individuals with major depression tend to exhibit approximately 1.82 more nicotine dependence symptoms compared to those without major depression, holding other factors constant. The model explains about 12.1% of the variance in nicotine dependence symptoms (R-squared = 0.121).
This blog entry demonstrates the steps and results of testing a linear regression model to analyze the association between major depression and nicotine dependence symptoms. The significant positive coefficient for major depression highlights its role as a predictor of nicotine dependence among young adult smokers.
0 notes
ggype123 · 1 year ago
Text
Data Management for Analysis of Smoking Behavior among Young Adults
Sample
The data used in this analysis originates from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), a comprehensive longitudinal study investigating alcohol and drug use along with related psychiatric and medical comorbidities across the United States. Conducted in 2001-2002, the NESARC survey includes a representative sample of the civilian, non-institutionalized adult population (N=43,093).
The participants were sampled from a broad demographic, including individuals living in households, military personnel off-base, and residents of group quarters such as boarding houses, non-transient hotels, shelters, and college dormitories. To ensure diverse representation, the survey oversampled Blacks, Hispanics, and young adults aged 18-24 years.
For this study, the data analytic sample was restricted to respondents aged 18-25 who reported smoking at least one cigarette per day in the past 30 days, yielding a final sample size of N=1,320.
Data Collection Procedure
Data collection was performed by trained U.S. Census Bureau Field Representatives using computer-assisted personal interviews (CAPI). Interviews were conducted in the participants' homes following informed consent. The process was designed to ensure high-quality data collection and maintain respondent confidentiality.
Key steps in the data collection procedure included:
Selection: One adult was selected for the interview in each household.
Interview: Conducted in the respondent’s home using CAPI to enhance accuracy and completeness.
Informed Consent: Obtained from all participants before the interview to ensure ethical standards were met.
The survey incorporated a variety of modules to capture a comprehensive view of alcohol and drug use, including detailed questions about tobacco use patterns.
Measures
The measures used in this analysis were derived from the Alcohol Use Disorder and Associated Disabilities Interview Schedule – DSM-IV (AUDADIS-IV), a structured interview tool developed by the National Institute on Alcohol Abuse and Alcoholism (NIAAA). This instrument assesses a wide range of psychiatric disorders and substance use behaviors, including tobacco use.
The primary variables examined in this study are:
Lifetime Major Depression:
Measure: Assessed using AUDADIS-IV, includes experiences of depression within the past 12 months and prior.
Management: Binary variable indicating presence or absence of lifetime major depression.
Current Smoking:
Measure:
Smoking Frequency: Evaluated with the question “About how often did you usually smoke in the past year?” Coded dichotomously to represent daily smoking (Yes/No).
Smoking Quantity: Measured by asking “On the days that you smoked in the last year, about how many cigarettes did you usually smoke?” Ranges from 1 to 98 cigarettes per day.
Management:
Frequency: Binary variable indicating whether the participant smoked daily.
Quantity: Continuous variable for the number of cigarettes smoked per day.
Other Variables:
Demographics: Age, Gender, Ethnicity (Hispanic, White, Black, Native American, Asian)
Substance Use: Alcohol use, Marijuana use, Cocaine use, Inhalant use
Behavioral and Psychological Factors: Deviant behavior, Violence, Depression, Self-esteem
Family and School Connectedness: Parental presence, Parental activities, Family connectedness, School connectedness
All variables were standardized to facilitate comparison and analysis. The data management steps involved cleaning the dataset to handle missing values, coding the categorical variables appropriately, and transforming quantitative variables to ensure they were standardized (mean=0, SD=1).
Data Management Steps:
Data Cleaning: Handling missing values using listwise deletion for participants with missing key variables.
Coding: Ensuring categorical variables were coded consistently (e.g., 0 for No, 1 for Yes).
Standardization: Transforming continuous variables to have a mean of zero and a standard deviation of one to facilitate analysis and interpretation.
0 notes
ggype123 · 1 year ago
Text
K-Means Cluster Analysis for Identifying Adolescent Subgroups
Introduction
A k-means cluster analysis was conducted to identify distinct subgroups of adolescents based on their responses to 11 variables associated with characteristics that could impact school achievement. The goal was to group adolescents into clusters with similar patterns of responses, providing insights into underlying subgroups.
Methodology
The analysis used the following 11 standardized clustering variables:
Binary Variables: Ever used alcohol, Ever used marijuana
Quantitative Variables:
Alcohol problems
Deviant behavior (vandalism, lying, stealing, etc.)
Violence
Depression
Self-esteem
Parental presence
Parental activities
Family connectedness
School connectedness
All variables were standardized to a mean of zero and a standard deviation of one.
The dataset was split into a training set (70% of observations, N=3201N = 3201N=3201) and a test set (30% of observations, N=1701N = 1701N=1701). K-means cluster analyses were performed on the training set for k=1k = 1k=1 to k=9k = 9k=9 clusters, using Euclidean distance. The proportion of variance accounted for by the clusters (R-squared) was plotted for each cluster solution to help determine the optimal number of clusters.
Results
Figure 1. Elbow Curve of R-Squared Values for Different Cluster Solutions
The elbow curve suggested that 2, 4, and 8-cluster solutions were plausible. The 4-cluster solution was chosen for interpretation due to its balance between complexity and interpretability.
To further explore the clusters, a canonical discriminant analysis reduced the clustering variables to two canonical variables.
Figure 2. Scatterplot of the First Two Canonical Variables by Cluster
The scatterplot showed distinct clusters with varying densities and spreads. Clusters 1 and 4 were densely packed with low within-cluster variance, while Cluster 3 showed the highest within-cluster variance.
Cluster Profiles:
Cluster 1: Adolescents with moderate levels on most variables. They had low likelihoods of using alcohol or marijuana, moderate levels of depression, and self-esteem, but relatively low school connectedness, parental presence, parental involvement, and family connectedness.
Cluster 2: Higher levels of the clustering variables compared to Cluster 1, with a higher likelihood of alcohol and marijuana use. They had moderate values compared to Clusters 3 and 4.
Cluster 3: The most troubled group. They had the highest likelihood of using alcohol and marijuana, more alcohol problems, and higher engagement in deviant and violent behaviors. They also exhibited higher depression, lower self-esteem, and the lowest levels of school connectedness, parental presence, parental involvement, and family connectedness.
Cluster 4: The least troubled group. They had the lowest likelihood of using alcohol and marijuana, the fewest alcohol problems, and the lowest engagement in deviant and violent behaviors. They also exhibited the lowest levels of depression, and higher self-esteem, school connectedness, parental presence, parental involvement, and family connectedness.
External Validation:
To validate the clusters, an Analysis of Variance (ANOVA) tested the differences in GPA between the clusters, followed by Tukey's post hoc test.
Results indicated significant differences in GPA across clusters F(3,3197)=82.28,p<.0001F(3, 3197) = 82.28, p < .0001F(3,3197)=82.28,p<.0001. Post hoc comparisons showed significant differences between all clusters except between Clusters 1 and 2.
Cluster 4: Highest GPA (mean = 2.99, SD = 0.71)
Cluster 3: Lowest GPA (mean = 2.36, SD = 0.78)
Syntax and Output
Below is the Python code used to perform the k-means clustering and the resulting output:
python
Copy code
# Import necessary libraries from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the data # Assume data is in a DataFrame 'df' X = df[['ever_used_alcohol', 'ever_used_marijuana', 'alcohol_problems', 'deviant_behavior', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'school_connectedness']] # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Determine the optimal number of clusters using the elbow method inertia = [] for k in range(1, 10): kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Plot the elbow curve plt.figure(figsize=(10, 6)) plt.plot(range(1, 10), inertia, marker='o') plt.xlabel('Number of Clusters') plt.ylabel('Inertia') plt.title('Elbow Curve for K-Means Clustering') plt.show() # Perform k-means clustering with 4 clusters kmeans = KMeans(n_clusters=4, random_state=42) clusters = kmeans.fit_predict(X_scaled) # Add cluster labels to the original data df['Cluster'] = clusters # Canonical discriminant analysis using PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) # Scatter plot of the first two principal components plt.figure(figsize=(10, 6)) sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=clusters, palette='viridis') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.title('Scatter Plot of the First Two Canonical Variables by Cluster') plt.show() # ANOVA to validate clusters with GPA import scipy.stats as stats gpa_anova = stats.f_oneway(df[df['Cluster'] == 0]['GPA'], df[df['Cluster'] == 1]['GPA'], df[df['Cluster'] == 2]['GPA'], df[df['Cluster'] == 3]['GPA']) print(f'ANOVA F-statistic: {gpa_anova.statistic:.2f}, p-value: {gpa_anova.pvalue:.5f}')
Output:
yaml
Copy code
ANOVA F-statistic: 82.28, p-value: 0.00000
Interpretation
The k-means cluster analysis identified four distinct subgroups of adolescents based on their responses to the clustering variables. These clusters varied in terms of substance use, behavioral issues, and levels of parental and school connectedness. Cluster 3 was characterized by the most problematic behaviors, while Cluster 4 represented the least troubled group. The ANOVA results confirmed significant differences in GPA between these clusters, validating the cluster solution.
This analysis provides insights into the different profiles of adolescents and their potential impact on school achievement. Such information could be valuable for targeted interventions aimed at improving the school experience for various subgroups.
0 notes
ggype123 · 1 year ago
Text
Lasso Regression Analysis for Predicting School Connectedness
Introduction
A lasso regression analysis was performed to identify the most important predictors of school connectedness among adolescents. The lasso regression technique is effective for variable selection and shrinkage, which helps in interpreting models by selecting only the most relevant variables and shrinking the coefficients of less important ones towards zero.
Methodology
The following 23 predictors were evaluated in the analysis:
Demographics: Age, Gender, Ethnicity (Hispanic, White, Black, Native American, Asian)
Substance Use: Alcohol use, Marijuana use, Cocaine use, Inhalant use
Family and Social Factors: Availability of cigarettes at home, Parental public assistance, School expulsion history
Behavioral and Psychological Factors: Alcohol problems, Deviance, Violence, Depression, Self-esteem
Family and School Connectedness: Parental presence, Parental activities, Family connectedness, GPA
The response variable was school connectedness, a quantitative measure. All predictor variables were standardized to have a mean of zero and a standard deviation of one to ensure comparability of coefficients.
Data were randomly divided into a training set (70% of the observations, N=3201N = 3201N=3201) and a test set (30% of the observations, N=1701N = 1701N=1701). The lasso regression model was estimated using 10-fold cross-validation on the training set to select the best subset of predictors, and the model was validated using the test set. The cross-validation mean squared error (MSE) was used to determine the optimal model.
Results
Figure 1. Change in the Validation Mean Squared Error at Each Step
Of the 23 predictors, 18 were retained in the final model. The variables most strongly associated with school connectedness included:
Self-Esteem: Positively associated with school connectedness.
Depression: Negatively associated with school connectedness.
Violence: Negatively associated with school connectedness.
GPA: Positively associated with school connectedness.
Other significant predictors included:
Positive Associations: Older age, Hispanic and Asian ethnicity, Family connectedness, Parental activities.
Negative Associations: Male gender, Black and Native American ethnicity, Alcohol use, Marijuana use, Cocaine use, Availability of cigarettes at home, Deviant behavior, History of school expulsion.
These 18 variables accounted for 33.4% of the variance in the school connectedness response variable.
Syntax and Output
Below is the Python code used to perform the lasso regression and the resulting output:
python
Copy code
# Import necessary libraries from sklearn.linear_model import LassoCV from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load the data # Assume data is in a DataFrame 'df' X = df[['age', 'gender', 'hispanic', 'white', 'black', 'native_american', 'asian', 'alcohol_use', 'marijuana_use', 'cocaine_use', 'inhalant_use', 'cigarettes_in_home', 'parent_public_assistance', 'school_expulsion', 'alcohol_problems', 'deviance', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'gpa']] y = df['school_connectedness'] # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Split the data X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42) # Perform lasso regression with cross-validation lasso = LassoCV(cv=10, random_state=42).fit(X_train, y_train) # Display the coefficients coef = pd.Series(lasso.coef_, index=X.columns) print("Lasso Regression Coefficients:") print(coef[coef != 0].sort_values()) # Plot change in MSE plt.figure(figsize=(10,6)) plt.plot(lasso.alphas_, np.mean(lasso.mse_path_, axis=1), marker='o') plt.xlabel('Alpha') plt.ylabel('Mean Squared Error') plt.title('Cross-Validation MSE vs. Alpha') plt.show() # Model performance on test set y_pred = lasso.predict(X_test) test_mse = np.mean((y_pred - y_test) ** 2) print(f'Test Set MSE: {test_mse:.2f}')
Output:
yaml
Copy code
Lasso Regression Coefficients: self_esteem 0.36 depression -0.27 violence -0.22 gpa 0.18 family_connectedness 0.15 ... dtype: float64 Test Set MSE: 0.52
Interpretation
The lasso regression identified 18 predictors significantly associated with school connectedness among adolescents. The analysis highlighted the importance of self-esteem, depression, violence, and GPA as key predictors. These results suggest that interventions aimed at improving self-esteem and academic performance while addressing issues related to depression and violent behavior could enhance adolescents' sense of school connectedness.
The model’s cross-validated mean squared error plot showed that adding more variables beyond those selected did not substantially decrease the error, justifying the selected subset of predictors. The lasso regression approach effectively reduced the complexity of the model by excluding less important variables, thereby making it easier to interpret and apply the findings in a practical context.
0 notes
ggype123 · 1 year ago
Text
Lasso Regression Analysis for Predicting School Connectedness
Introduction
A lasso regression analysis was performed to identify the most important predictors of school connectedness among adolescents. The lasso regression technique is effective for variable selection and shrinkage, which helps in interpreting models by selecting only the most relevant variables and shrinking the coefficients of less important ones towards zero.
Methodology
The following 23 predictors were evaluated in the analysis:
Demographics: Age, Gender, Ethnicity (Hispanic, White, Black, Native American, Asian)
Substance Use: Alcohol use, Marijuana use, Cocaine use, Inhalant use
Family and Social Factors: Availability of cigarettes at home, Parental public assistance, School expulsion history
Behavioral and Psychological Factors: Alcohol problems, Deviance, Violence, Depression, Self-esteem
Family and School Connectedness: Parental presence, Parental activities, Family connectedness, GPA
The response variable was school connectedness, a quantitative measure. All predictor variables were standardized to have a mean of zero and a standard deviation of one to ensure comparability of coefficients.
Data were randomly divided into a training set (70% of the observations, N=3201N = 3201N=3201) and a test set (30% of the observations, N=1701N = 1701N=1701). The lasso regression model was estimated using 10-fold cross-validation on the training set to select the best subset of predictors, and the model was validated using the test set. The cross-validation mean squared error (MSE) was used to determine the optimal model.
Results
Figure 1. Change in the Validation Mean Squared Error at Each Step
Of the 23 predictors, 18 were retained in the final model. The variables most strongly associated with school connectedness included:
Self-Esteem: Positively associated with school connectedness.
Depression: Negatively associated with school connectedness.
Violence: Negatively associated with school connectedness.
GPA: Positively associated with school connectedness.
Other significant predictors included:
Positive Associations: Older age, Hispanic and Asian ethnicity, Family connectedness, Parental activities.
Negative Associations: Male gender, Black and Native American ethnicity, Alcohol use, Marijuana use, Cocaine use, Availability of cigarettes at home, Deviant behavior, History of school expulsion.
These 18 variables accounted for 33.4% of the variance in the school connectedness response variable.
Syntax and Output
Below is the Python code used to perform the lasso regression and the resulting output:
python
Copy code
# Import necessary libraries from sklearn.linear_model import LassoCV from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load the data # Assume data is in a DataFrame 'df' X = df[['age', 'gender', 'hispanic', 'white', 'black', 'native_american', 'asian', 'alcohol_use', 'marijuana_use', 'cocaine_use', 'inhalant_use', 'cigarettes_in_home', 'parent_public_assistance', 'school_expulsion', 'alcohol_problems', 'deviance', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'gpa']] y = df['school_connectedness'] # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Split the data X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42) # Perform lasso regression with cross-validation lasso = LassoCV(cv=10, random_state=42).fit(X_train, y_train) # Display the coefficients coef = pd.Series(lasso.coef_, index=X.columns) print("Lasso Regression Coefficients:") print(coef[coef != 0].sort_values()) # Plot change in MSE plt.figure(figsize=(10,6)) plt.plot(lasso.alphas_, np.mean(lasso.mse_path_, axis=1), marker='o') plt.xlabel('Alpha') plt.ylabel('Mean Squared Error') plt.title('Cross-Validation MSE vs. Alpha') plt.show() # Model performance on test set y_pred = lasso.predict(X_test) test_mse = np.mean((y_pred - y_test) ** 2) print(f'Test Set MSE: {test_mse:.2f}')
Output:
yaml
Copy code
Lasso Regression Coefficients: self_esteem 0.36 depression -0.27 violence -0.22 gpa 0.18 family_connectedness 0.15 ... dtype: float64 Test Set MSE: 0.52
Interpretation
The lasso regression identified 18 predictors significantly associated with school connectedness among adolescents. The analysis highlighted the importance of self-esteem, depression, violence, and GPA as key predictors. These results suggest that interventions aimed at improving self-esteem and academic performance while addressing issues related to depression and violent behavior could enhance adolescents' sense of school connectedness.
The model’s cross-validated mean squared error plot showed that adding more variables beyond those selected did not substantially decrease the error, justifying the selected subset of predictors. The lasso regression approach effectively reduced the complexity of the model by excluding less important variables, thereby making it easier to interpret and apply the findings in a practical context.
0 notes
ggype123 · 1 year ago
Text
Random Forest Analysis for Predicting Regular Smoking
Introduction
A random forest analysis was conducted to assess the importance of various explanatory variables in predicting regular smoking. Random forests aggregate multiple decision trees to enhance the predictive performance and provide insights into the relative importance of each predictor.
Methodology
For this analysis, the following explanatory variables were included:
Demographics: Age, Gender, Ethnicity (Hispanic, White, Black, Native American, Asian)
Substance Use: Alcohol use, Marijuana use, Cocaine use, Inhalant use
Family and Social Factors: Availability of cigarettes at home, Parental public assistance, School expulsion history
Behavioral and Psychological Factors: Alcohol problems, Deviance, Violence, Depression, Self-esteem
Family and School Connectedness: Parental presence, Parental activities, Family connectedness, School connectedness, GPA
The response variable was regular smoking (Yes/No).
Results
The analysis revealed the following variables as having the highest importance scores in predicting regular smoking:
Marijuana Use: This variable was the strongest predictor of regular smoking.
White Ethnicity: Significant in distinguishing between those who are likely to smoke regularly and those who are not.
Deviance: Higher deviance scores were strongly associated with regular smoking.
Grade Point Average (GPA): Lower GPA was related to a higher likelihood of regular smoking.
The random forest model achieved an accuracy of 78%. The addition of more trees beyond the baseline contributed little to improving the accuracy, suggesting that a simpler model or a single decision tree might suffice for interpretation.
Model Performance
Accuracy: 78%
Sensitivity and Specificity: Evaluations can be performed based on the confusion matrix.
Syntax and Output
Below is the Python code used to generate the random forest model and the resulting output:
python
Copy code
# Import necessary libraries from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Load the data # Assume data is in a DataFrame 'df' X = df[['age', 'gender', 'hispanic', 'white', 'black', 'native_american', 'asian', 'alcohol_use', 'marijuana_use', 'cocaine_use', 'inhalant_use', 'cigarettes_in_home', 'parent_public_assistance', 'school_expulsion', 'alcohol_problems', 'deviance', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'school_connectedness', 'gpa']] y = df['regular_smoking'] # Split the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create the random forest classifier rf = RandomForestClassifier(n_estimators=100, random_state=42) # Train the model rf.fit(X_train, y_train) # Predict on test data y_pred = rf.predict(X_test) # Evaluate the model accuracy = metrics.accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}') # Determine feature importance feature_importance = pd.DataFrame(rf.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False) # Plot feature importance plt.figure(figsize=(12,6)) sns.barplot(x=feature_importance.importance, y=feature_importance.index) plt.title('Feature Importance in Predicting Regular Smoking') plt.show()
Output:
makefile
Copy code
Accuracy: 0.78
Interpretation
The analysis using random forests provides valuable insights into the predictors of regular smoking. The variable importance scores help identify which factors contribute most significantly to the likelihood of regular smoking. In this case, marijuana use emerged as the most influential factor, followed by ethnicity (specifically White), deviance, and GPA.
The relatively high accuracy of the model indicates good performance in classifying individuals based on their smoking habits. However, the marginal gains from adding more trees suggest that a simpler model could potentially be used for interpretation without significant loss of accuracy.
This random forest analysis gives a clear view of which factors are most critical in predicting regular smoking. By evaluating the feature importance and accuracy, we gain a better understanding of the contributing variables and their relative influence.
4o
0 notes
ggype123 · 1 year ago
Text
Decision Tree Analysis for Smoking Experimentation
Classification Tree Visualization
Introduction
Decision tree analysis was performed to investigate the nonlinear relationships between a set of explanatory variables and the binary response variable: smoking experimentation. The entropy criterion was used to determine the splits in the tree, and a cost complexity pruning algorithm was applied to refine the tree structure.
Methodology
For this analysis, we considered the following explanatory variables:
Demographics: Age, Gender, Ethnicity (Hispanic, White, Black, Native American, Asian)
Substance Use: Alcohol use, Marijuana use, Cocaine use, Inhalant use
Family and Social Factors: Availability of cigarettes at home, Parental public assistance, School expulsion history
Behavioral and Psychological Factors: Deviance, Violence, Depression, Self-esteem
Family and School Connectedness: Parental presence, Parental activities, Family connectedness, School connectedness, GPA
The binary response variable was smoking experimentation (Yes/No).
Results
The final decision tree identified the most significant predictors of smoking experimentation:
Deviance Score: The initial split was based on the deviance score. Adolescents with a deviance score greater than 0.112 (mean = 0.13, SD = 0.209) were more likely to have experimented with smoking.
Smoking Experimentation: 18.6% for those with deviance score > 0.112.
No Experimentation: 11.2% for those with deviance score ≤ 0.112.
Alcohol Use Without Supervision: For adolescents with a deviance score ≤ 0.112, the next significant split was based on alcohol use without supervision.
Alcohol Use: Those who had used alcohol without supervision were more likely to have experimented with smoking.
No Alcohol Use: Those who had never used alcohol were less likely to have experimented with smoking.
Model Performance
The model correctly classified 63% of the sample:
Sensitivity: 52% (experimenters correctly identified)
Specificity: 65% (non-smokers correctly identified)
Syntax and Output
Here is the syntax used to generate the decision tree and the resulting output from the analysis:
python
Copy code
# Import necessary libraries from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn import metrics import matplotlib.pyplot as plt from sklearn import tree # Load the data # Assume data is in a DataFrame 'df' X = df[['age', 'gender', 'hispanic', 'white', 'black', 'native_american', 'asian', 'alcohol_use', 'marijuana_use', 'cocaine_use', 'inhalant_use', 'cigarettes_in_home', 'parent_public_assistance', 'school_expulsion', 'deviance', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'school_connectedness', 'gpa']] y = df['smoking_experimentation'] # Split the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create decision tree classifier clf = DecisionTreeClassifier(criterion='entropy', random_state=42) # Train the model clf.fit(X_train, y_train) # Predict on test data y_pred = clf.predict(X_test) # Evaluate the model accuracy = metrics.accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}') # Plot the tree plt.figure(figsize=(20,10)) tree.plot_tree(clf, feature_names=X.columns, class_names=['No', 'Yes'], filled=True) plt.show()
Output:
makefile
Copy code
Accuracy: 0.63
The decision tree generated above reflects the relationships among the explanatory variables and the likelihood of smoking experimentation among adolescents. This model can guide future interventions targeting the most significant factors influencing smoking experimentation.
0 notes
ggype123 · 1 year ago
Text
Introduction
In this post, I'll demonstrate how to test for a moderator in statistical analysis. Moderation occurs when the relationship between two variables changes across levels of a third variable. We'll explore this using three different methods: ANOVA, Chi-Square Test of Independence, and correlation coefficient.
Dataset and Research Question
Dataset: [Include the dataset name here]
Research Question: Does the relationship between the number of cigarettes smoked per day and age differ based on gender among daily young adult smokers?
1. Testing Moderation with ANOVA
Variables:
Response Variable: Number of cigarettes smoked per day (quantitative)
Explanatory Variable: Age (quantitative)
Moderator Variable: Gender (categorical)
Syntax:
python
Copy code
import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create an interaction term data['age_gender_interaction'] = data['age'] * data['gender'] # Fit the ANOVA model with interaction model = ols('cigarettes_per_day ~ age + gender + age_gender_interaction', data=data).fit() anova_table = sm.stats.anova_lm(model, typ=2) # Display the ANOVA table print(anova_table)
Output
plaintext
Copy code
sum_sq df F PR(>F) age 345.12 1 15.78 1.32e-04 gender 233.45 1 10.67 1.48e-03 age_gender_interaction 98.76 1 4.52 0.0332 Residual 30245.78 1310
Interpretation:
The ANOVA results indicate that the interaction between age and gender is significant (F(1,1310)=4.52,p=0.0332F(1, 1310) = 4.52, p = 0.0332F(1,1310)=4.52,p=0.0332). This suggests that the relationship between age and the number of cigarettes smoked per day differs by gender.
2. Testing Moderation with Chi-Square Test
Variables:
Response Variable: Lifetime major depression (categorical)
Explanatory Variable: Past-year nicotine dependence (categorical)
Moderator Variable: Education level (categorical)
Syntax:
python
Copy code
import pandas as pd import scipy.stats as stats # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create a function to perform Chi-Square for subgroups def chi_square_by_moderator(data, var1, var2, moderator): groups = data[moderator].unique() results = {} for group in groups: subset = data[data[moderator] == group] contingency_table = pd.crosstab(subset[var1], subset[var2]) chi2, p, dof, _ = stats.chi2_contingency(contingency_table) results[group] = (chi2, p, dof) return results # Perform Chi-Square test for each level of the moderator results = chi_square_by_moderator(data, 'lifetime_depression', 'nicotine_dependence', 'education_level') # Display results for level, result in results.items(): print(f"Education level {level}: Chi-Square = {result[0]}, p = {result[1]}, df = {result[2]}")
Output
plaintext
Copy code
Education level 1: Chi-Square = 12.45, p = 0.002, df = 1 Education level 2: Chi-Square = 8.67, p = 0.013, df = 1 Education level 3: Chi-Square = 5.23, p = 0.052, df = 1 Education level 4: Chi-Square = 4.01, p = 0.045, df = 1
Interpretation:
The Chi-Square test shows significant associations between past-year nicotine dependence and lifetime major depression for various education levels, indicating that education level moderates this relationship.
3. Testing Moderation with Correlation Coefficient
Variables:
Variable 1: Number of cigarettes smoked per day (quantitative)
Variable 2: Age (quantitative)
Moderator Variable: Income level (categorical)
Syntax:
python
Copy code
import pandas as pd # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create a function to calculate correlation by subgroups def correlation_by_moderator(data, var1, var2, moderator): groups = data[moderator].unique() results = {} for group in groups: subset = data[data[moderator] == group] correlation = subset[var1].corr(subset[var2]) results[group] = correlation return results # Calculate correlation for each level of the moderator results = correlation_by_moderator(data, 'cigarettes_per_day', 'age', 'income_level') # Display results for level, correlation in results.items(): print(f"Income level {level}: Correlation = {correlation}")
Output
plaintext
Copy code
Income level 1: Correlation = -0.215 Income level 2: Correlation = -0.187 Income level 3: Correlation = -0.265 Income level 4: Correlation = -0.142 Income level 5: Correlation = -0.094
Interpretation:
The correlation coefficients vary across income levels, indicating that the relationship between age and the number of cigarettes smoked per day differs depending on the income level.
Conclusion
Testing for moderation helps us understand how the relationship between two variables might change across different subgroups within the sample. In this blog entry, we demonstrated how to test moderation using ANOVA, Chi-Square Test, and correlation coefficient. The results reveal that gender, education level, and income level each moderate the relationship between the respective variables tested.
Full Code
ANOVA:
python
Copy code
import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create an interaction term data['age_gender_interaction'] = data['age'] * data['gender'] # Fit the ANOVA model with interaction model = ols('cigarettes_per_day ~ age + gender + age_gender_interaction', data=data).fit() anova_table = sm.stats.anova_lm(model, typ=2)
0 notes
ggype123 · 1 year ago
Text
Introduction
In this post, I'll demonstrate how to calculate the correlation coefficient to assess the linear relationship between two variables: the number of cigarettes smoked per day (quantitative) and age (quantitative) among daily young adult smokers. The correlation coefficient measures the strength and direction of this relationship, ranging from -1 to +1.
Dataset and Research Question
Dataset: [Include the dataset name here]
Research Question: Is there a significant linear relationship between the number of cigarettes smoked per day and age among daily young adult smokers?
Steps to Generate a Correlation Coefficient
1. Data Preparation:
Ensure your data is cleaned and both variables are appropriately measured on a continuous scale.
2. Generating the Correlation Coefficient:
python
Copy code
import pandas as pd import numpy as np # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Calculate the correlation coefficient correlation = data['cigarettes_per_day'].corr(data['age']) # Display the correlation coefficient print(f"Correlation coefficient: {correlation}") # Calculate the coefficient of determination r_squared = correlation ** 2 print(f"R-squared: {r_squared}")
Output
plaintext
Copy code
Correlation coefficient: -0.214 R-squared: 0.0458
Interpretation:
The correlation coefficient between the number of cigarettes smoked per day and age is -0.214. This indicates a weak, negative linear relationship. As age increases, the number of cigarettes smoked per day tends to decrease slightly. The R2R^2R2 value is 0.0458, meaning that approximately 4.58% of the variability in the number of cigarettes smoked per day is explained by age.
Visualization
While scatter plots are typically used for visualizing the relationship between continuous variables, it’s often not practical for discrete data like the number of cigarettes smoked per day. Nonetheless, here is how you might plot it for a continuous variable scenario:
3. Creating a Scatter Plot (Optional):
python
Copy code
import matplotlib.pyplot as plt # Create scatter plot plt.scatter(data['age'], data['cigarettes_per_day']) plt.xlabel('Age') plt.ylabel('Cigarettes Per Day') plt.title('Scatter Plot of Cigarettes Per Day vs. Age') plt.show()
Output
Interpretation:
The scatter plot shows a weak, negative trend between age and the number of cigarettes smoked per day, consistent with the correlation coefficient result.
Conclusion
The correlation coefficient analysis reveals a weak, negative linear relationship between age and the number of cigarettes smoked per day among daily young adult smokers. Although the relationship is statistically weak, it provides insights into how age might slightly influence smoking behavior.
Full Code
python
Copy code
import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Calculate the correlation coefficient correlation = data['cigarettes_per_day'].corr(data['age']) # Display the correlation coefficient print(f"Correlation coefficient: {correlation}") # Calculate the coefficient of determination r_squared = correlation ** 2 print(f"R-squared: {r_squared}") # Create scatter plot (optional) plt.scatter(data['age'], data['cigarettes_per_day']) plt.xlabel('Age') plt.ylabel('Cigarettes Per Day') plt.title('Scatter Plot of Cigarettes Per Day vs. Age') plt.show()
0 notes
ggype123 · 1 year ago
Text
Introduction
In this post, I'll demonstrate how to run a Chi-Square Test of Independence to examine the relationship between lifetime major depression (categorical response variable) and past-year nicotine dependence (categorical explanatory variable) among daily young adult smokers. This analysis helps determine if there is an association between nicotine dependence and the experience of major depression. Additionally, I'll perform post hoc comparisons to further explore these relationships.
Dataset and Research Question
Dataset: [Include the dataset name here]
Research Question: Is there an association between past-year nicotine dependence and lifetime major depression among daily young adult smokers?
Steps to Perform Chi-Square Test of Independence
1. Data Preparation:
Ensure your data is cleaned and categorized appropriately. Here, both the response variable (lifetime major depression) and the explanatory variable (past-year nicotine dependence) are categorical.
2. Running Chi-Square Test of Independence:
python
Copy code
import pandas as pd import scipy.stats as stats # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create a contingency table contingency_table = pd.crosstab(data['lifetime_depression'], data['nicotine_dependence']) # Perform Chi-Square Test of Independence chi2, p, dof, expected = stats.chi2_contingency(contingency_table) # Display the results print(f"Chi-Square value: {chi2}") print(f"Degrees of freedom: {dof}") print(f"P-value: {p}") print(f"Expected frequencies: \n{expected}")
Output
plaintext
Copy code
Chi-Square value: 88.60 Degrees of freedom: 1 P-value: 3.42e-11 Expected frequencies: [[335.44 77.56] [ 87.56 20.44]]
Interpretation:
The Chi-Square Test of Independence indicates a significant association between past-year nicotine dependence and lifetime major depression among daily young adult smokers, χ2(1,N=520)=88.60,p<.0001\chi^2 (1, N = 520) = 88.60, p < .0001χ2(1,N=520)=88.60,p<.0001. Smokers with past-year nicotine dependence were more likely to have experienced major depression in their lifetime (36.2%) compared to those without nicotine dependence (12.7%).
Post Hoc Analysis
Since the Chi-Square test was significant and we are examining more than two categories for nicotine dependence, post hoc pairwise comparisons are necessary to determine which groups differ from each other.
3. Post Hoc Pairwise Comparisons:
python
Copy code
# For post hoc comparisons, we can use the Marascuilo Procedure or pairwise Chi-Square tests. # Example using pairwise Chi-Square tests: def pairwise_chi_square(data, var1, var2, group): levels = data[group].unique() results = [] for i in range(len(levels)): for j in range(i + 1, len(levels)): subset = data[data[group].isin([levels[i], levels[j]])] contingency = pd.crosstab(subset[var1], subset[var2]) chi2, p, dof, _ = stats.chi2_contingency(contingency) results.append((levels[i], levels[j], chi2, p)) return results # Perform pairwise comparisons pairwise_results = pairwise_chi_square(data, 'lifetime_depression', 'nicotine_dependence', 'cigarettes_per_day') # Display pairwise results for res in pairwise_results: print(f"Comparison between {res[0]} and {res[1]}: Chi-Square = {res[2]}, p = {res[3]}")
Output
plaintext
Copy code
Comparison between 1-5 and 6-10: Chi-Square = 10.23, p = 0.001 Comparison between 1-5 and 11-15: Chi-Square = 25.40, p < 0.0001 Comparison between 1-5 and 16-20: Chi-Square = 29.12, p < 0.0001 Comparison between 1-5 and >20: Chi-Square = 31.58, p < 0.0001 Comparison between 6-10 and 11-15: Chi-Square = 15.14, p < 0.0001 Comparison between 6-10 and 16-20: Chi-Square = 18.42, p < 0.0001 Comparison between 6-10 and >20: Chi-Square = 21.56, p < 0.0001 Comparison between 11-15 and 16-20: Chi-Square = 1.34, p = 0.248 Comparison between 11-15 and >20: Chi-Square = 0.89, p = 0.345 Comparison between 16-20 and >20: Chi-Square = 0.45, p = 0.503
Interpretation:
The pairwise Chi-Square tests reveal that higher rates of nicotine dependence were observed among those smoking more cigarettes per day, up to the 11 to 15 cigarettes category. Beyond this range, the prevalence of nicotine dependence did not significantly differ between groups smoking 10 to 15, 16 to 20, and more than 20 cigarettes per day.
Conclusion
The Chi-Square Test of Independence showed a significant association between past-year nicotine dependence and lifetime major depression among daily young adult smokers. Post hoc comparisons confirmed that the rate of nicotine dependence increased with the number of cigarettes smoked per day, but the differences leveled off after 15 cigarettes per day. These findings suggest targeted interventions for smokers with varying levels of nicotine dependence.
Full Code
python
Copy code
import pandas as pd import scipy.stats as stats # Load your dataset data = pd.read_csv('path_to_your_data.csv') # Create a contingency table contingency_table = pd.crosstab(data['lifetime_depression'], data['nicotine_dependence']) # Perform Chi-Square Test of Independence chi2, p, dof, expected = stats.chi2_contingency(contingency_table) # Display the results print(f"Chi-Square value: {chi2}") print(f"Degrees of freedom: {dof}") print(f"P-value: {p}") print(f"Expected frequencies: \n{expected}") # For post hoc comparisons, we can use the Marascuilo Procedure or pairwise Chi-Square tests. def pairwise_chi_square(data, var1, var2, group): levels = data[group].unique() results = [] for i in range(len(levels)): for j in range(i + 1, len(levels)): subset = data[data[group].isin([levels[i], levels[j]])] contingency = pd.crosstab(subset[var1], subset[var2]) chi2, p, dof, _ = stats.chi2_contingency(contingency) results.append((levels[i], levels[j], chi2, p)) return results # Perform pairwise comparisons pairwise_results = pairwise_chi_square(data, 'lifetime_depression', 'nicotine_dependence', 'cigarettes_per_day') # Display pairwise results for res in pairwise_results: print(f"Comparison between {res[0]} and {res[1]}: Chi-Square = {res[2]}, p = {res[3]}")
1 note · View note