Don't wanna be here? Send us removal request.
Text
Week Three: Preliminary Statistical Findings
Descriptive Statistics:
Table 1 below shows statistics for the categorical data analytic variables of interest for the sample population. Percentage of the sample population that were involved in at least one event of recidivism, during the nine-month follow-up period, was 59%.
Bivariate Analysis
Bar graphs for the associations between recidivism response variables and CBM predictor treatment (Figure 1) revealed that recidivism of all forms occurred at a decreased rate when CBM was undertaken. Chi square analysis, used to test the association between treatment and recidivism rates, however, indicates that none of the associations were significant (see Table 2).
Figure 1: Association between CBM and rates of recidivism during 9-month follow-up period.
Table 2: Bivariate Analysis
Multivariate Analysis – Lasso Regression Analysis.
Lasso Regression analysis was used to identify a set of variables that most strongly predict recidivism among parolees. Thirty-three predictor variables associated with gender, number of minor children, education, past drug use, arrest history, and means of financial support were considered. Table Three show the 12 variables were retained in the model selected by the lasso regression analysis. Number of minor children; unemployment based financial support; other opiate, hallucinogenic, and amphetamine usage are associated with recidivism. Number of minor children, usage of other opiates and hallucinogens, financial support from partner/friend were positively associated with recidivism whereas unemployment based financial support and amphetamine usage were negatively associated.
Table 3: Lasso Least Angle Regression Selection Summary
Together, these 12 predictors accounted for only 12.8 % (R square value .1279) of the variance in recidivism suggesting the model has very weak predictive value. Additionally, Figure 4 below shows that the predictive accuracy of the lasso regression algorithm developed on the training set data (MSE=.25132) differed noticeably from the test data set (MSE=.21257). Of note, the variable of CBM treatment was identified as the 20th variable of importance.
Challenges Faced in Undertaking Preliminary Analysis:
My original intention was to analyze the effectiveness of CBM treatment for a subset of the population - women with minor children. The subset of the sample population however is not large enough to carry out this analysis so I will be rewriting my research question to reflect a more general interest in understanding recidivism for the general sample.
I would also like to include race variables in my Lasso regression analysis. The race variable data needed to be cleaning (missing data had not been recorded with a “0″ value) so initial attempts at running the data resulted in error statements. After re-coding the data and attempting to run LAR another error message appeared: Selection aborted because there are no suitable observations for training. At this point, I have not been able to solve this issue. In addition, I am uncertain as to the significance of the following statement that was printed under the LAR selection summary table - Selection stopped because all candidate effects for entry are linearly dependent on effects in the model.
0 notes
Text
Evaluating Collaborative Management Programs: Methods Section
Sample:
The sample included N= 450 drug-involved parolees who were randomly assigned to participate in CBM or usual parole system. The study was conducted from 2004 and 2008. The population from which the sample was drawn were identified as parolees who were at least 18 years of age, English speaking, and per drug screen results likely drug dependent immediately prior to incarceration. Additional requirements for inclusion included: drug treatment as recommended or mandated as part of parole conditions and identified as a moderate to high risk of recidivism on the basis of a Lifestyle Criminality Screening Form (LCSF) or prior convictions. Persons with psychotic symptoms were not included in the study.
Measures:
The primary analysis is whether CBM intervention decreases recidivism in comparison to standard parole. Recidivism was measured by any incidences of arrest, illegal drug use, commission of a crime during the nine-month follow up period. The secondary analysis considers whether there are moderating factors that influence effectiveness, such as, female gender and presence of minor children, history of drug usage, previous criminal record, major form of economic support, among other. Data regarding predictors were taken collected in intake interviews. Data on response variables for recidivism were collected through follow up interviews with parolees, data collected at/by parole offices, and toxicology results from urine tests.
Analyses:
Initial primary analysis was conducted by calculating frequencies for participation in CBM or standard parole (categorical predictor variable) and the three categorical response variables of recidivism. Chi square test for independence was used to test the bivariate association between parolee treatment (CBM or standardized parole) and three types of recidivism (the response variable).
Initial secondary analysis was conducted by calculating frequencies for participation in CBM or standard parole and the three categorical response variables for recidivism only for women with minor children. Similarly, a Chi square test for independence was used to test the bivariate association between parolee treatment (CBM or standardized parole) and three types of recidivism (the response variable) only for women with minor children.
A linear regression was used to test the association between parolee treatment and three forms of recidivism for the whole sample and the sub sample of women with minor children. Bar graphs are used to depict the relationship between frequency of recidivism to parolee treatment.
To further understand the importance of a series of predicator variables (gender, history of drug usage, previous criminal record, major form of economic support, race, among others) to evaluating the potential effectiveness of CBM, a lasso regression analysis using k-fold cross validation was performed. Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables. All predicator variables were categorical. A graph for the coefficient progression was included to depict the relative importance of the various predictors and the direct of association.
0 notes
Text
Evaluating the Effectiveness of Collaborative Management Programs for Reducing Recidivism Among Drug-Involved Female Parolees with Minor Children.
Research Introduction: Substance use is wide-spread among persons convicted of crimes, prisoners and parolees. Prisoner and parolees do not normally receive drug abuse treatment in prison or upon release and recidivism amongst substance users is fairly high. To reduce recidivism some jurisdictions have developed addiction treatment programs: the programs utilize various models of treatment and typically bring together parole officers and treatment counselors to collaboratively introduce, offer, and motivate parolees to engage in drug treatment.
Focus Data Set: My research focuses on data collected by the Step n' Out study, a multi-site randomized behavioral trial conducted from 2004 to 2008. The Step n' Out study involved 450 drug-involved parolees who were randomly assigned to collaborative behavioral management or usual parole. Parolees were followed at 3 months and 9 months to assess if they were rearrested and the status of their drug usage. Step n' Out data collection efforts included: 1) intake interviews which covered sociodemographic background, family and peer relations, health and psychological status, criminal involvement, drug use history, and HIV=AIDS risk behaviors and 2) follow up interviews specific to measuring recidivism to crime and drug relapse and involved self-report and toxicology results from urine tests. Data was also collected from parole officers and treatment counselors to understand the character, structure and process of collaboration that constitute the therapeutic alliance.
Research Question and Rationale: The purpose of analysis is twofold. First, to analyze the comparative effectiveness of the six collaborative behavioral management for reducing drug use relapse and recidivism. And second, to examine influence of moderating effects, such as participant characteristics and type of drug use. In particular, I am interested in understanding the comparative effectiveness of the program for women with minor children who have been involved in hard drugs. Other (hidden) variables of interest are: homeless status, educational status, employment status, and form of economic support. If CBM programs are effective, their wider adoption could improve outcomes for offenders and families, and also strengthen collaborations between the criminal justice and addiction treatment systems.
References: PETER D. FRIEDMANN , ELIZABETH C. KATZ , ANNE G. RHODES , FAYE S. TAXMAN , DANIEL J. O'CONNELL , LINDA K. FRISMAN , WILLIAM M. BURDON , BENNETT W. FLETCHER , MARK D. LITT , JENNIFER CLARKE & STEVEN S. MARTIN (2008) Collaborative Behavioral Management for Drug-Involved Parolees: Rationale and Design of the Step'n Out Study, Journal of Offender Rehabilitation, 47:3, 290-318, DOI: 10.1080/10509670802134184
0 notes
Text
Machine Learning Assignment #4 K-cluster analysis
The data I am working with is from the National Longitudinal Study of Adolescent Health Study (AddHealth). The study is based on school-based survey of adolescents 7th to 12th graders. I am interested in creating an intervention program to reduce violent behavior among students. I am running a K-cluster analysis to identify characteristics of students that are associated with violent behavior. Violent behavior is measured in terms of involvement in physical fights.
The clustering variables I am analyzing are as follows: alcohol consumption ( H1TO15), gun availability in home (H1TO53); truancy (H1ED2), expulsion from school (H1ED9), grade point average (H1ED 11-14); classmate closeness ( H1ED19); school connectedness (H1ED20); school happiness (H1ED22); maternal home presence (H1RM11 and H1RM12); paternal presence (H1RF11 and H1RF12) ; family understanding (H1PR5); freedom to decide own weekend curfew ( H1WP1); witness of violence (H1FV1), thoughts of suicide (H1SU1); likeliness one will go onto college (H1EE2). The validation variable is: involvement in serious physical fight (H1DS5).
Information regarding measurement of each variable is provided below:
Alcohol consumption ( H1TO15)- 6 point scale descending order
Gun availability in home (H1TO53) – binary categorical
Truancy (H1ED2) – quantitative (0-99)
Expulsion from school (H1ED9) – binary categorical
Grade point average (H1ED 11-14) – 4 point scale descending order
Classmate closeness ( H1ED19) – 5 point scale descending order
School connectedness (H1ED20) – 5 point scale descending order
School happiness (H1ED22) - 5 point scale descending order
Maternal and paternal home presence (H1RM11, H1RM12, H1RF11 and H1RF12) – 5 point scale descending
Family understanding (H1PR5) – 5 point scale ascending order
Involvement in serious physical fight (H1DS5) – 4 point scale ascending order
Freedom to decide own weekend curfew ( H1WP1) – binary categorical
Witness of violence (H1FV1) – three point ascending scale,
Thoughts of suicide (H1SU1) – binary categorical
Likeliness one will go onto college (H1EE2) -five point scale ascending order
Printed below is the code, unique to my question:
libname mydata "/courses/d1406ae5ba27fe300" access=readonly;
data clust;
set mydata.addhealth_pds;;
/* create a unique identifier to merge cluster assignment variable with
the main data set*/
idnum=_n_;
keep idnum H1TO15 H1TO53 H1ED2 H1ED9 H1ED11 H1ED12 H1ED13 H1ED14 H1ED19
H1ED20 H1ED22 H1RM11 H1RM12 H1RF11 H1RF12 H1PR5 H1WP1 H1FV1 H1SU1 H1EE2 H1DS5;
/*delete observations with missing data;
if cmiss(of _all_) then delete;
run;
ods graphics on;
/* Split data randomly into test and training data*/
proc surveyselect data=clust out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
data clus_train;
set traintest;
if selected=1;
run;
data clus_test;
set traintest;
if selected=0;
run;
/*standardize the clustering variables to have a mean of 0 and standard deviation of 1*/
proc standard data=clus_train out=clustvar mean=0 std=1;
VAR H1TO15 H1TO53 H1ED2 H1ED9 H1ED11 H1ED12 H1ED13 H1ED14 H1ED19
H1ED20 H1ED22 H1RM11 H1RM12 H1RF11 H1RF12 H1PR5 H1WP1 H1FV1 H1SU1 H1EE2;
RUN;
%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
VAR H1TO15 H1TO53 H1ED2 H1ED9 H1ED11 H1ED12 H1ED13 H1ED14 H1ED19
H1ED20 H1ED22 H1RM11 H1RM12 H1RF11 H1RF12 H1PR5 H1WP1 H1FV1 H1SU1 H1EE2;
(Of note, one should run this code without the delete missing syntax if one has already cleaned the data for missing data.)
The elbow curve graph for the nine cluster solutions is below:
The elbow curve suggests that both 4 and 5 solutions may be optimal. A canonical discriminant analyses was used to reduce 20 clustering variables down to 4 clusters. The resulting scatterplot is provided below:
The scatterplot indicates that the observations in clusters 3 and 4 were densely packed and had little within cluster variance and did not overlap much with other clusters. Clusters 2 and 1 both had high within cluster variation, as suggested by their respective spreads. Based on the result of this initial 4 variable plot, I would further evaluate the cluster solutions by probably running a 2-cluster analysis.
The means table for the four-cluster solution indicate the differences in values for the explanatory variables. Because the data was standardized before performing the cluster analysis, the ranges of these numbers seem strange. One should probably also look at the variables in their original to better understand their meaning. Regardless when centered, negative value means lower in scaled value than most and positive value means higher in scaled value than most. Of note, because the scaling of data varied in whether it was ascending or in descending order, interpretation of the negative or positive values required thorough understanding of the variable characteristics. That is, a negative value for descending ordered variable means the opposite of a negative value for an ascending ordered variable. Cluster 1 had a high likelihood: availability of guns at home, high rates of school truancy, expulsion, low grades, moderate to low levels of classmate closeness, school connectedness, and school happiness, low rates of parental presence, freedom to decide one’s own curfew, thoughts of suicide. In contrast, cluster 4 indicated a low likelihood of gun availability, low rates of school truancy, expulsion, high grades, moderate to high levels of classmate closeness, school connectedness, and school happiness, low rates of parental presence, low likelihood of deciding one’s own curfew, low thoughts of suicide. Outcomes for variables for alcohol consumption and likeliness of going to college were surprising. Low rates of alcohol consumption for cluster 1 and high rates for cluster 4. Surprising results returning to the data to check accuracy of interpretation.
The box plot depicts the means of the clustering variables. The box plot below indicates that there is a different prevalence of physical fighting particularly comparing clusters 1 and 2 with clusters 3 and 4.
In order to test whether the clusters were associated significantly with differences in violent behavior, a Tukey Analysis of Variance was conducted. The Tukey test results indicated significant differences between the clusters in regard to involvement in physical fighting (425.22, p<.001). The largest significant differences in regard to prevalence of physical fighting between clusters are those between clusters 1 and 4, clusters 1 and 3, clusters 2 and 4, and clusters 2 and 3. Students in cluster 1 had the highest prevalence of fighting and cluster 4 had the lowest prevalence of fighting. Note because prevalence of fighting was not measured quantitatively but rather as scaled categorical variable there is no mean value.
0 notes
Text
Machine Learning for Data Analysis Week #3 Lasso Regression
Background: The data I am working with is from the National Longitudinal Study of Adolescent Health Study (AddHealth). The study is based on school-based survey of adolescents 7th to 12th graders. For this week’s assignment, I wanted to consider the same relationships as I did for Week Two to compare the output differences between the Random Forest and Lasso Regression. Due the variable requirements, I am using the quantitative variable H1FV13 instead of the categorical variable H1FV5 as the response variable for violent behavior. For the blog post, I have only included the output and interpretation for the Lasso Regression.
Research Variable: I am interested in predicting violent behavior (H1FV13) understood as being involved a physical fight. I utilized a Lasso Regression analysis to evaluate the importance of a series of explanatory variables: alcohol consumption (H1TO15WEEKLY), witness of violence in the form of a shooting or stabbing (H1FV1), truancy (H1ED2YES), parental imposed weekend curfew (H1WP1), expulsion from school (H1ED9), out of school suspension (H1ED7), lack of feeling of social acceptance (H1PF35CATEGORY), parents care about self (H1PR3), formal education on how to handle conflict (H1TS14), grade point average (H1ED11-14), and attendance at church services (H1RE3). Additionally, the following demographic data were included as possible contributors: gender (BIO_SEX), (race/ethnicity) Hispanic (H1GI4), White (H1GI6A), Black (H1GI6B), Native American/Native American (H1GI6C) and Asian/Pacific Islander (H1GI6D). Except for grade point averages and attendance at church service, all variables are binary categorical.
The syntax used for lasso regression analysis is below. Data management steps included: recoding sex to be a binary 0/1 response and deleting observations with missing data.
Disclaimer: I should note that I am not happy with the outcome of this analysis. I tried several ways to add the commands to delete missing data to the SAS code I have been building throughout this course. I consistently received error or warning messages: ERROR 180-322: Statement is not valid or it is used out of proper order. WARNING MESSAGES: Data set WORK.TRAINTEST was not replaced because new file is incomplete. Finally, I omitted the delete missing data and received a clean run. Although the output tables were identical after running the various permutations, I do not have confidence in the output.
Output and interpretation:
The lasso regression used 2393 observations: 1667 were used for training and 726 for testing.
Of the 20 predictor variables, The GLMSelect procedure table results indicate that witness of violence in the form of a shooting or stabbing (H1FV1) is the most important predictor of involvement in violence. Other important predictors are in the order of importance: alcohol consumption (H1TO15WEEKLY), expulsion from school (H1ED9), grade point average in English and arts (H1ED11). CV PRESS column, however, shows that witness of violence in the form of a shooting or stabbing (H1FV1) is the best model for predicting involvement in violence. The addition of other variables (such as alcohol consumption, school expulsion, etc.) increases model error.
The co-efficient progression plot for violent behavior graphically indicates the relative importance of the predictor variables and the direction of the association. As clearly indicated in the plot below, witness of violence is the best predictor, and is the chosen best fit model. Addition of variables increases error as indicated in the CVPress plot. As indicated in the coefficient progression table, H1FV1 (witness of violence) is positively associated with involvement violence.
The progression of average squared errors table indicates, as expected, that the model has a greater prediction error in the test data set than in the trial data set. But the patterns of error are similar through the process of adding variables.
And finally, the Select Model Table, again indicates that LASSO regression has selected the model at Step 1. The R square value of 0.02 indicates that only 2% of the variability of involvement in a physical fight can be accounted by witnessing violence. This suggests that the model still has extremely weak predictive capabilities.
0 notes
Text
Machine Learning for Data Analysis - Week #2 Random Forests
Background: The data I am working with is from the National Longitudinal Study of Adolescent Health Study (AddHealth). The study is based on school-based survey of adolescents 7th to 12th graders.
Research Variable: I am interested in predicting violent behavior (H1FV5) understood as being in a physical fight. I utilized a random forest analysis to evaluate the importance of a series of explanatory variables: alcohol consumption (H1TO15WEEKLY), witness of violence in the form of a shooting or stabbing (H1FV1), truancy (H1ED2YES), parental imposed weekend curfew (H1WP1), expulsion from school (H1ED9), out of school suspension (H1ED7), lack of feeling of social acceptance (H1PF35CATEGORY), parents care about self (H1PR3), formal education on how to handle conflict (H1TS14), grade point average (H1ED11-14), and attendance at church services (H1RE3). Additionally, the following demographic data were included as possible contributors: gender (BIO_SEX), (race/ethnicity) Hispanic (H1GI4), White (H1GI6A), Black (H1GI6B), Native American/Native American (H1GI6C) and Asian/Pacific Islander (H1GI6D). Except for grade point averages and attendance at church service, all variables are binary categorical. The following variables were not included in the decision tree analysis (week #1): formal education on how to handle conflict, grade point average, and attendance at church services (H1RE3).
The syntax used for random forest analysis is below. Note that all data was cleaned for missing data and the binary response variable was recoded as 2= no, per instructions. (Cleaning and recoding data is not included below.)
PROC HPFOREST;
target H1FV5/level=nominal;
input H1TO15WEEKLY H1FV1 H1ED2YES H1WP1 H1ED9 H1ED7 H1PF35CATEGORY H1PR3 H1TS14 BIO_SEX HIGI4 H1GI6A H1GI6B H1GI6C H1GI6D /level=nominal;
input H1ED11 H1ED12 H1ED13 H1ED14 H1RE3/level=interval;
RUN;
The random forest analysis conducted used the GINI index for the split criteria. By default, the random forest analysis grew 100 trees and selected 60% of the sample for bagging. Of 6504 read observations, 6455 were used. The baseline fit statistics table tells us that the forest correctly classified 68% of the sample (1- .317). (See tables below)
The fit statistic table (below) indicates that by 16 trees the statistical accuracy has leveled off at a misclassification rate (OOB) of approximately 30%. This does not represent considerable increased accuracy from the single decision tree error outcome of 34% suggesting that an interpretation of single decision tree may be appropriate.
The out of bag error estimates in the loss reduction variable importance table tell us that the most importance variables in order are: out of school suspension (H1ED7), seen a shooting or stabbing (H1FV1), gender/sex (BIO_SEX), expulsion from school (H1ED9), truancy (H1ED2YES), Native American/Asian (H1G16C-D), alcohol consumption (H1TO15WEEKLY), lack of feeling of social acceptance (H1PF35CATEGORY), and Hispanic (H1GI4). Note that the random forest does not specify if association between any of these explanatory variables and the response variable is direct or inverse.
None of the new variables (formal education on how to handle conflict, grade point average, and church attendance) included in this analysis appeared as important variables to predicting violent behavior. Both the random forest and single decision tree analysis still suffer from high rate of error and inability to predict violent behavior respectively suggesting to me that there may be important explanatory variables that I have not yet included in my modeling.
0 notes
Text
Machine Learning Data Analysis - Assignment #1 Running a Classification Tree
Background: The data I am working with is from the National Longitudinal Study of Adolescent Health Study (AddHealth). The study is based on school-based survey of adolescents 7th to 12th graders.
Research Variable: The following explanatory variables were included as possible contributors to a classification tree model evaluating recent violent behavior (my response variable): recent alcohol consumption (H1TO15WEEKLY), recent witness of violence in the form of a shooting or stabbing (H1FV1), recent truancy (H1ED2YES), parental imposed weekend curfew (H1PW1), ever expelled from school (H1ED9), ever out of school suspension (H1ED7), and lack of feeling of social acceptance (H1PF35CATEGORY), parental cares about self (H1PR3). Additionally, the following demographic data were included as possible contributors: gender (BIO_SEX), (race/ethnicity) Hispanic (HIGI4) , White (H1GI6A), Black (H1GI6B), Native American/Native American (H1GI6C) and Asian/Pacific Islander (H1GI6D). All of the variables are categorical. (It should be noted that the survey utilizes a variety of times frames -the fighting within 12 months is asked as variable H1DS5 and H1FV5. Re BIO_SEX 1=MALE; 2=FEMALE)
A decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable (violent behavior). All possible separations (categorical) were tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.
The syntax used for generating the decision tree is below. Note that all data was cleaned for missing data and the binary response variable was recoded as 2= no, per instructions. (Cleaning and recoding data is not included below.)
ODS GRAPHICS ON;
PROC HPSPLIT seed=12345;
CLASS H1FV5 H1TO15WEEKLY H1FV1 H1ED2YES H1WP1 H1ED9 H1ED7 H1PF35CATEGORY H1PR3YES BIO_SEX H1GI4 H1GI6A H1GI6B H1GI6C H1GI6D;
MODEL H1FV5= H1TO15WEEKLY H1FV1 H1ED2YES H1WP1 H1ED9 H1ED7 H1PF35CATEGORY
H1PR3YES BIO_SEX H1GI4 H1GI6A H1GI6B H1GI6C H1GI6D;
grow entropy;
prune costcomplexity;
RUN;
Number of leaves before pruning was 305 and after 8. 6083 of 6508 observations were used to generate the final tree. See model information table below.
Find the final tree generated by the program and associated diagram depicted below:
The ever out of school suspension (H1ED7) was the first variable to separate the sample into two subgroups. Of the 1631 adolescents who had an out of school suspension 45% had been in a physical fight in the past 12 months. Of the 4,452 adolescents who had never had an out of school suspension,
Other splits are gender (BIO_SEX) and witness of stabbing or shooting (H1FV1) and these variables appear twice. There are six terminal nodes representing 6 subgroups. There are three subgroups that show high rates of violence (indicated by high yes “1” value). The highest of these is the subgroup that had experienced an out of school suspensions and witnessed a stabbing or shooting. Of 368 respondents in this category 82% had been in a physical fight in the past 12 months. Another group of interest is boys who had never had a suspension but had witnessed a stabbing or shooting in the past 12 months. 160 respondents fit in this group and 63% had been in a physical fight in the past 12 months.
The model-based confusion matrix tells us how well the final classification tree performed. The total model correctly classifies 32% who have been in a physical fight within the past 12 months (1-.67) and 93% of those who have not been in physical fight within the past 12 months (1-.0715). The model better predicts those who are protected from violent behavior that those who are at risk for violent behavior.
The variable importance table below tells us that race (Hispanic) and truancy are also important variables but that they are masked by gender.
0 notes
Text
Regression Modeling in Practice: Week 4 Assignment Logistical Modeling.
RESEARCH VARIABLES: DESCRIPTION AND MANAGEMENT
The research question for this assignment pertains to possible explanatory variables for violent behavior measured by being in a physical fight (H1FV5YES). The explanatory variables of interest are alcohol consumption (H1TO15WEEKLY), witness of violence in the form of a shooting or stabbing (H1FV1), truancy (H1ED2YES), parental imposed weekend curfew (H1PW1), and lack of feeling of social acceptance (H1PF35CATEGORY). The data for alcohol consumption, physical fight, witness of violence, and truancy all pertain to the last 12 months and were based on self-reports. Alcohol consumption and physical fight were recorded as a ranked categorical variable. During data management the respective variables were binned into binary categorical variables. In the case of alcohol consumption of weekly and greater alcohol consumption or not and in the case of physical fight as yes or no. Truancy was recorded as quantitative variable and was recoded as binary categorical variable of yes or no. Feeling of social acceptance was recorded as a ranked categorical variable; this was recoded as “does not feel socially accepted” yes or no.
LOGISTIC REGRESSION: SINGLE VARIABLE RESULTS
Weekly alcohol consumption and physical fighting were significantly associated (p<.0001) OR 2.256 and 95% CI =1.909-2.665.
Witness of violence and physical fighting were significantly associated (p<.0001) OR 5.641 and 95% CI= 4.806-6.62.
Truancy and physical fighting were significantly associated (p<.0001) OR 2.093 95% CI 1.866-2.347.
Feelings of lack of social acceptance and physical fighting were not significantly associated (p=.0592).
Parental curfew and physical fighting were not significantly associated (p=.4675)
LOGISTIC REGRESSION: MULTIPLE VARIABLES
Witnessing of violence and alcohol consumption are independently associated with involvement of physical fighting (p<.0001 for both variables). The witness of violence does not confound the association between alcohol consumption and involvement in physical fighting. The OR estimates tell us that respondents in our sample who consumed alcohol on a weekly or more frequent basis were 1.9 times more likely to be involved in a physical fight that those who did not drink to that extent after controlling for the witnessing of a shooting or stabbing. Also, respondents who had witness a shooting or stabbing were approximately 5.4 times more likely to be involved in a physical fight that those who had not witness such behavior, after controlling weekly or greater alcohol consumption. Moreover, because the confidence intervals do not overlap, we can also say that witnessing a shooting or stabbing is more heavily associated with involvement in physical fighting than weekly or greater alcohol consumption. Extending our findings to the general population, we can say that youth who consume alcohol on a weekly or greater basis are 1.6 to 2.3 times more likely to be in a physical fight than those who do not drink to that extent (and after controlling for the witnessing of a shooting or stabbing). Additionally, youth who have witnessed a shooting or stabbing are 4.6 -6.3 times more likely to be in a physical fight than those how have not witnessed a shooting or stabbing (and after accounting for alcohol consumption.)
Adding truancy to the logical regression, we see that truancy, witnessing of shooting/stabbing, and weekly or greater alcohol consumption are all independently associated with involvement in physical fighting (p<.0001 for all variables while controlling for the influence of other variables). That is, the logical regression did not indicate that any of the three variables had a confound effect.
The OR estimates tell us that respondents in our sample who consumed alcohol on a weekly or more frequent basis were 1.6 times more likely to be involved in a physical fight that those who did not drink to that extent after controlling for the witnessing of a shooting or stabbing and truancy. Also, respondents who had witness a shooting or stabbing were approximately 5 times more likely to be involved in a physical fight that those who had not witness such behavior, after controlling weekly or greater alcohol consumption and truancy. Finally, respondents who had been truant from school were 1.7 times to be involved in a physical fight than those who had not been truant, after controlling for weekly or greater alcohol consumption and witnessing of stabbing or shooting.
Overlapping confidence intervals for alcohol consumption and truancy means that we cannot say that truancy is more heavily associated with physical fighting than alcohol consumption. As suggested by confidence interval, we can say that the witnessing of stabbing or shooting is more heavily associated with involvement in physical fighting than the other variables. Extending our findings to the general population, we can say that youth who consume alcohol on a weekly or greater basis are 1.3 to 1.9 times more likely to be in a physical fight than those who do not drink to that extent (and after controlling for the witnessing of a shooting or stabbing and truancy). Additionally, youth who have witnessed a shooting or stabbing are 4.2 to 5.9 times more likely to be in a physical fight than those how have not witnessed a shooting or stabbing (and after controlling for alcohol consumption and truancy.) And finally, youth who are truant are 1.5 to 1.9 times more likely to be involved in physical fighting than those who are not truant (and after controlling for alcohol consumption and witnessing of shooting or stabbing).
The results of the logistic regression support the association between the primary variable -alcohol consumption- and response variable -involvement in physical fighting. Results, however, also show that witnessing of stabbing or shooting has a stronger association.
0 notes
Text
Regression Modeling in Practice Week #3 Assignment
RESEARCH VARIABLES: CHARACTER AND ASSOCIATED SURVEY QUESTION
The research variables for this week’s assignment are alcohol usage (explanatory variable) and violent behavior (response variable). Both variables were measured through self-report and pertain to the respondents’ last 12 months. The explanatory variable H1TO16 (table 28) is reported as a quantitative variable. Adolescents were asked about the number of alcoholic drinks consumed on average each time they drank. The response variable H1FV13 (table 31) was reported as a quantitative variable. Respondents were asked “During the past 12 months, how many times were you in a physical fight in which you were injured and had to be treated by a doctor or nurse?”
Confounding variables of possible interest include delinquency (measured as school truancy), sexual activity, and witnessing violent behavior. Confounding variables were chosen that closely adhere to the same time frame as the primary explanatory and response variable, that is, respondents were asked questions related to the previous last 12 months or year. Truancy variable H1ED2 (table 5) was reported as a quantitative variable. Respondents were asked how many times they had skipped school for a full day without an excuse. The sexual activity variable H1NR7 (table 26) was reported as a quantitative variable. Respondents were asked Since January 1, 1994, with how many people in total have you had a sexual relationship? The witnessing violent behavior variable H1FV1 (table 31) was reported as a categorical variable. Respondents were asked, During the past 12 months, how often have you seen someone shoot or stab another person with a response scale of never (0), once (1), more than once (2). The confounding explanatory variable was collapsed to yes or no.
MULTIPLE REGRESSION ANALYSIS FINDINGS
The multiple regression analysis with the explanatory variable of quantity of alcohol (H1TO16_c) and confounding variables of truancy (H1ED2_c), sexual relationships (H1NR7_c), and witnessing of violence (H1FV1YES) when parsed out shows the following:
1) sexual activity and alcohol consumption are both significantly and positively associated with incidents of serious physical fighting – the former has a p value of <.0001 and parameter estimate of .0537 and the latter has a p value of .0021 and a parameter estimate of .0286.
2) Truancy rate and witnessing violence however are not significantly associated (they have p values of .3207 and .3935 respectively).
These multiple regression results indicate that sexual activity is a confounding variable for explaining the incidence of physical fighting.
Looking only at multiple regression results for sexual activity and alcohol consumption, depicted below, we see an intercept value of .43 which tells us that at the mean value for alcohol consumption and sexual relationships, we would expect the incidents of physical fighting to be .43. Additionally, the R square value of .095 means that sexual activity and alcohol consumption together can account for 9.5% incidence of physical violence. This low percentage suggests that there is a more compelling explanatory variable for incidence of physical violence that has not yet been identified or that the relationship in not linear.
TESTING FOR MISSPECIFICATION
To understand the likely existence model specification error, I perform residual plot for alcohol consumption and sexual activity. I do not include the other two variables because it has already been determined that there is not a significant association with the response variable.
The resulting Q-Q plot, depicted below, shows significant deviation from a straight line. This tell us that our current linear equation of 2 variables does not provide a good fit for the observed incidents of physical violence and there are other explanatory variables.
The standardized residuals for all observations, depicted below, shows that there are no extreme outliers, ie. no observations that are more than three standard deviations away from the mean, that may be effecting the fit.
The outlier and leverage plot depicted below show us that we have only a few observations that are both outliers and are unduly influencing (leveraging) the model estimation. In short, error is probably only minimally due to outliers having leverage.
1 note
·
View note
Text
Regression Modeling in Practice - Assignment Week #2
The program codes used as are follows:
Step #1:
IF H1TO15<=3 THEN H1TO15WEEKLY=1;
ELSE IF H1TO15=4 OR H1TO15=5 OR H1TO15=6 OR H1TO15=7 OR H1TO15=97 THEN H1TO15WEEKLY=0;)
Step #2:
PROC FREQ; TABLES H1TO15WEEKLY
Step #3:
PROC GLM; model H1FV5=H1TO15WEEKLY/solution;
RUN;
Step#4: PROC GCHART; VBAR H1TO15WEEKLY/discrete TYPE=mean SUMVAR= H1FV5;
Output table and graph are as follows:
0 notes
Text
Regression Modeling in Practice - Assignment Week #2
Linear regression model results are as follows:
The F statistic and p value for the association are 41.82 and <.0001 respectively which tells us that we can conclude that frequency of involvement in physical fights is associated with alcohol consumption among high school age students. The output table arrives at a linear regression formula of: violent behavior = .48 + .31 (alcohol usage). The association is thus positive. We would expect high school students who drink alcohol on a less than weekly basis to get into a physical fight .48 times a year. We would expect those high school students who drink on a weekly or greater basis to get into a physical fight .89 times a year. The R square value of .014764 tell us that only about 1.5% of the variability of violent behavior (being involved in a fight) can be accounted for by drinking alcohol.
0 notes
Text
Regression Modeling in Practice - Assignment Week #2
In this week’s assignment, I am running a linear regression to test the association between alcohol usage (explanatory variable) and violent behavior (response variable). My research question is: is drinking weekly or more frequently associated with an increased frequency of getting into physical fights? Both variables are ranked categorical variables. To fulfill the assignment, I did the following steps to the data:
1) collapse the explanatory variable (alcohol usage) to two levels: weekly or greater alcohol usage and less than weekly usage (H1TO15WEEKLY) and code the explanatory variable such that 0=less than weekly usage and 1=weekly or greater usage;
2) create a frequency table to check coding;
3) run the linear regression model using the following syntax suitable for binary categorical variable;
4) generate bar graph to depict the relationship between frequency of violent behavior to alcohol usage.
The frequency table results for step #2 is pasted below:
0 notes
Text
Study Sample, Data Collection Procedures, and Research Variables
Step 1 (Study Sample):
The sample I am using is from the first wave of the National Longitudinal Study of Adolescent Health Study (AddHealth). 20,745 adolescents attending 7th to 12th grade participated in Wave I of the Add Health school administered questionnaire. The data sample for this analysis included 6504 participants/observations who provided self-reports for alcohol usage and violent behavior.
Step 2 (Data Collection Procedures):
Add Health is the largest and most comprehensive longitudinal school-based survey of adolescents ever undertaken in the United States. The in-school survey was administered nationally to a representative sample of 7th to 12th graders in 1994. Five subsequent in-home interview surveys were conducted in 1995, 1996, 2001-2, 2008, and currently 2016-8. The longitudinal approach as enabled to researchers to study the experiences and behaviors of adolescents as they transition into adulthood. The sample for this study is taken from the Wave 1 survey which focuses on factors that may influence adolescents’ health and risk behaviors, including personal traits, families, friendships, romantic relationships, peer groups, schools, neighborhoods, and communities. The Add Health sample design involved a sample size of 80 high schools and 52 middle schools throughout the United States wherein participants has an unequal probability of selection. The sampling design was conducted to ensure that the samples was representative of US schools in regards to region, degree of urbanization, school size and type, and ethnic composition.
Data was collected by fieldworkers trained and managed by the National Opinion Research Center of the University of Chicago. Sections 24-33 of the Wave 1 interview were administered using computer-assisted personal interviews (CAPI). The response rate for Wave 1 was 79%.
Step 3 (Research Variables):
The research variables for this study are alcohol usage (explanatory variable) and violent behavior (response variable). Both variables were measured through self-report, pertain to the respondents’ last 12 months, and are reported as categorical variables. Adolescents were asked about the frequency of alcohol usage with the response scale as follows: every day or almost every day; 3-5 days a week; 1-2 days a week; 2-3 days a month; once a month or less; 1-2 days in the last 12 months; and never in the last 12 months. Respondents were also asked if/how many times they have been in a physical fight in the last 12 months. The response scale was as follows: never, once, more than once.
Confounding variables of possible interest include having witnessed violent behavior (seeing a stabbing or shooting), carrying a weapon to school sexual activity, and with whom you fought (stranger, friend, girlfriend, family member).
To date, the data has been cleaned and undergone preliminary descriptive and bivariate analysis (Chi square test of Independence).
0 notes
Text
Data Analysis Tools - Assignment Week #4
I am using AddHealth data base which includes data on youth attitudes and behavior. I would like to know the relationship between alcohol and fighting behavior. My research question is: What is the relationship between alcohol usage (H1TO15) and number of times one has been in a physical fight (H1FV5) with the moderating variable as having seen someone shoot or stab another person (H1FV1). All questions pertain to the respondent’s last 12 months. H1TO15 (alcohol usage) is the explanatory variable and it is categorical. H1FV5 (fighting behavior) is the response variable and is also categorical. The moderating variable H1FV1 (seen a shooting or stabbing) and is categorical. The null hypothesis is that there is no relationship between alcohol usage and involvement in fighting.
A Chi Square Test of Independence for the two variables showed a significant association (Chi square value is 90.6196 with a p<.0001.) The associated frequency table and bar graphs (below) shows that the number of respondents who have been in a physical fight within the last 12 months generally decreases with lower frequencies of alcohol consumption, with the exception of never in the last 12 months). Or in other words, the higher frequency of alcohol consumption is associated with physical fighting. NB. In the about graph, the alcohol usage scale is in decreasing order and as follows: 1= every day or almost every day; 2= 3-5 days a week; 3= 1-2 days a week; 4= 2-3 days a month; 5= once a month or less; 6= 1-2 days in the last 12 months; 7= never in the last 12 months; 97= less the 3 times in one’s life- skip. The SAS codes used to run the Chi Square Test of Independence and for the bar graph were as follows:
/*CHI SQUARE ANALYSIS OF ALCOHOL USAGE AND PHYSICAL FIGHT*/ PROC FREQ; TABLES H1FV5YES*H1TO15/CHISQ; RUN;
/*bivariate graph categorical variables*/PROC GCHART; VBAR H1TO15/discrete TYPE=mean SUMVAR= H1FV5YES;
In regard to the introduction of the moderating variable “witnessing a stabbing or shooting within the last 12 months”:
A Chi Square Test of Independence found that alcohol usage and fighting behavior were significantly associated, with a X2 value of 47.1888 and p<.0001 in the case of respondents having not seen a shooting or stabbing within last 12 months. There however was not an association in the case of respondents having seen a shooting or stabbing once within last 12 months. (X2 value of 7.5846 and p=.3706) or more than once within the last 12 months (X2 value of 7.7226 and p=.3577). Thus the witnessing of a shooting or stabbing does moderate the influence of alcohol on physical fighting.
The SAS code used for running the Chi Square Tests with the moderating variable was as follows: The SAS code used for running the Chi Square Tests with the moderating variable was as follows:
/* testing effect of moderating variable*/ PROC SORT; by H1FV1; PROC FREQ; TABLES H1FV5YES*H1TO15/CHISQ; BY H1FV1; RUN;
The output tables follow below.
0 notes
Text
Data Analysis Tools - Week #3 Assignment
I am using AddHealth data base which includes data on youth attitudes and behavior. I would like to know the relationship between drug usage and sexual behavior. The variables I am working with are as follows:
Lifetime usage (During your life, how many times have you used X drug?) of marijuana (H1TO31), cocaine (H1TO35), inhalants (glue or solvents) (H1TO38), other illegal drugs (LSD, PCP, ecstasy, mushrooms, speed, ice, heroin, or pills, without a doctor’s prescription) (H1TO41) Number of sex partners (How many people have you ever had sexual relationships with?) (H1NR6)
My null hypothesis is that there is no relationship between number of times respondents use various drugs and the number of sexual partners they have had.
Below are four scatter plots depicting the relationship between number of sexual partners and lifetime quantity of consumption of (1) marijuana, (2) cocaine, (3) inhalants, and (4) other illegal drugs. The SAS code used for scatterplot is as follows:
PROC GPLOT; Plot H1NR6*H1TO31;
PROC GPLOT; Plot H1NR6*H1TO35;
PROC GPLOT; Plot H1NR6*H1TO38;
PROC GPLOT; Plot H1NR6*H1TO41;
The scatterplots are not very useful because of the scale at which they are rendered. They would be more informative if the scale was smaller and suggests that the survey questions need refinement. Unfortunately, we have not yet been taught if or how we can alter the scale of representational data.
Following below is the table for Pearson correlation test for the four drug types. The SAS code used for the Pearson correlation is as follows:
PROC CORR; VAR H1TO31 H1TO35 H1TO38 H1TO38 H1NR6;
The Pearson correlation value r for association between:
lifetime marijuana usage (explanatory variable) and lifetime number of sexual partners (response variable) is .076 with a p value of .059.
lifetime cocaine usage (explanatory variable) and lifetime number of sexual partners (response variable) is .157 with a p value of .245
lifetime inhalant usage (explanatory variable) and lifetime number of sexual partners (response variable) is .05 with a p value of .562.
lifetime other illegal drug usage (explanatory variable) and lifetime number of sexual partners (response variable) is -.0066 with a p value of .9383.
All of the r values are close to zero indicating a very weak linear relationship between the explanatory and response variables. Additionally, P-values suggest that the null hypothesis of no (linear) relationship between drug usage and sexual partners holds true. Again, because of the scale of the scatterplots it is difficult to assess whether there might a non-linear (curvilinear) relationship between any of the variables.
0 notes
Text
Data Analysis Tools - Week #2 Assignment.
Working with The National Longitudinal Study of Adolescent Heath (AddHealth) data base, the hypothesis that I developed in the precursor to this course related to the association between frequency of participation in three levels of physical activity (explanatory variable) and frequency of self-reported negative feelings, during the last week (response variable).
For this week’s assignment, I ran Chi Squared tests for the association between participation in active sports (defined as baseball, softball, basketball, soccer, swimming, and football) as the categorical explanatory variable (H1DA5) and self-reports of “feeling depressed” (H1FS6) as the categorical response variable. Specifically, I am interested in understanding if how frequently respondents report participating in active sports related to reports of feeling depressed.
The explanatory variable has 4 levels (not at all, 1 or 2 times, 3 or 4 times, and 5 or more times of exercise during the past week). The response variable also initially had 4 levels. These, however, were collapse to two levels (never/rarely/sometimes” and a “lot of/most/all the time”) (H1FS6CATEGORY).
I was required to run a X2 and post hoc analysis for 6 paired comparisons. The SAS code I used was as follows:
A Chi Square test of independence revealed that frequency of participation in active sports during the past week and self-reported negative feelings were were significantly associated, X2 =54.77, 3 df, p<.0001.The p value is less than .05 suggesting that the null hypothesis should be rejected. However, because I am working with a multiple category variable, I am required to do a post hoc analysis to avoid a Type One Error.
The Bonferroni adjusted p-value in my case of 4 categorical explanatory variables is .008. If the p value is less than .008 for any of the six paired comparisons, there is significant evidence against the null hypothesis of no association and we can therefore state that there is an association between the two variables.
The p-values for the paired comparisons are indicated in the table below.
Post hoc comparisons of frequency of self-reported negative feelings by participation in active sports revealed the following:
There are not significantly different self-reports of negative feelings between those participants who participate in sports 1-2 times a week.
There are significantly different self-reports of negative feelings between those participants who do not participate in sports at all and those that participate 3 times or more weekly.
There are significantly different self-reports of negative feelings between those participants who participate in sports 1-2 times a week and 3-4 times a week.
There are significantly different self-reports of negative feelings between those participants who participate in sports 1-2 times a week and 5 or more times a week.
There are not significantly different self-reports of negative feelings between those participants who participate in sports 3-4 times a week and 5 or more times a week.
To be sure that I understand the nature of the associations (direct or inverse), between participation and self-report negative feelings. I referenced a bar graph (see below). Unlike the bar graph used in class to denote associations between cigarette consumption and nicotine dependence, my bar graph depicts that the association/relationship is inverse. That is, when there is a significant association between the two variables, participation in active sports is associated to lower reports of negative feelings.
0 notes
Text
Data Analysis Tools - Assignment Week #1
The hypothesis that I developed in the precursor to this course related to the association between frequency of participation in three levels of physical activity and frequency of self-reported negative feelings, during the last week. These variables were measured as categorical variables.
This assignment, running an analysis of variance, requires a response variable that is quantitative. Most of the fields of enquiry in the AddHealth data base are categorical. Quantitative data is, however, included for questions regarding number of times in physical fight during past 12 months (H1FV13) and number of times used marijuana during past 30 days (H1TO31), number of days of the past seven when at least one parent was in room while you were eating your evening meal (H1WP8).
I have decided to run the ANOVA for the association between frequency of self-reported negative feeling “thought life has been a failure” (categorical explanatory variable H1FS9) and number of time used marijuana during the past 30 days (quantitative response variable H1to31). (I recognize that ideally the association would measure the same time frame.) There are four levels to the categorical explanatory variable: 0- never or rarely; 1- sometimes; 2 – a lot of the time; 3- most or all the time.
Before running the analysis, I managed the data for the new variable, ran frequency table, and ran univariate and bivariate analysis (as I had for all variables previously). The code I used for running ANOVA is:
PROC ANOVA; CLASS H1FS9;
MODEL H1TO31= H1FS9;
MEANS H1FS9;
RUN;
When examining the association between frequency of self-reported negative feelings “thought life had been a failure” and past year marijuana usage, ANOVA revealed that amongst those who rarely felt life had been a failure, the mean number of times used marijuana was 39 times and SD +/- 99. On the other end of the continuum, amongst those who felt their life had been a failure most or all of the time, the mean number of times used marijuana was 26 times and SD +/- 41. (The fact that the SD is larger than mean and can result in negative numbers, which is impossible, means that the data are heavily skewed and not a normal distribution). Looking at the table, the fact that mean values for the # of times marijuana was used per level of self-report feelings were much closer than the standard deviations also suggests that the null hypothesis will hold.
The ANOVA test arrives at an F-value of .38 and P value of .7710. In so far as the p-value is not less than .05, we cannot reject the null hypothesis. Thus, there is no relationship between the self-reported negative feeling of “thought life was a failure” and marijuana usage. As the result was not statistically significant, we do not need to do a POST HOC ANOVA test.
0 notes