higgerz-blog
higgerz-blog
The SAS Experience
16 posts
Don't wanna be here? Send us removal request.
higgerz-blog · 8 years ago
Text
Machine learning for data analysis: Assignment 4
My codebook I’m using for this assignment is the Addhealth codebook. I will be k-means cluster analysis relating alcohol consumption (drinking) to seven explanatory variables ranging from fist fights to depression and suicide and cigarette smoking. The exact definitions of the variables are given in the first block of the labeling in the code.  The code that generates my data is shown below:
Tumblr media Tumblr media Tumblr media
I will be discussing the relevant results from my code below, and the full output will be given at the end of the blogpost. First let’s take a look at the r-squared value as it depends on cluster size
Tumblr media
The rsquare value seems to drop off at around 7 clusters, although there is some strange behavior after the safest data to take is the one at 7 clusters. The means for the 7-cluster set are as follows:
Tumblr media
We can see that there are a few different types of clusters, 1 and 3 have all of the variables as positive, which we will call individuals with self-destructive behaviors. While cluster 7 has all of the variables as negative. Clusters 2,4,5, and 6 have mixed variables of positive and negative with all variables taking a negative value in at least one cluster. How these values actually cluster is shown in the graph below:
Tumblr media
Of the 7 different clusters 4,5 and 7 are the most densely packed. Whereas clusters 1,2,3, and 6 are pretty spread out. There is a fair amount of overlap between the clusters 7 and 5. And also a very close spreading of the data belonging to clusters 2 and 4 as well as 1 and 6. The results of the ANOVA test merging these variables with the DRINKING variable (have you been drinking in the past 12 months) is shown below.
Tumblr media
As with our previous table, the group with the most self-destructive behavior are group 1. This actually has one of the lowest means of DRINKING, aside from cluster 7 which actually has all negative values. But keeping all variables positive but lowering the mean of suicide attempts, actually results in more drinking. The other clusters are pretty close together in mean and not much meaningful information can be drawn. 
If interested the full complete output is shown below:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
higgerz-blog · 8 years ago
Text
Machine learning for data analysis: Assignment 3
My codebook I’m using for this assignment is the Addhealth codebook. I will be doing lasso regression relating alcohol consumption (drinking) seven explanatory variables ranging from fist fights to depression and suicide. The exact definitions of the variables are given in the first block of the labeling in the code. Values are all set to binary for convenience. The code that generates my data is shown below:
Tumblr media Tumblr media
The relevant output of the code was shown below
Tumblr media
We can see by the end of this model we have gotten improvement by about 0.5% (the variables used are not that relevant to the model). And being in a serious fist fight was not really relevant as the minimum CV press came after. 
Tumblr media
We can see from the above plots that all of our variables are actually negatively correlated with our response variable DRINKING, and being suspended from school being the most significant stealing something worth more than $50 being a close second. The CV press smoothly minimizes at the points where suicide attempts (ATTEMPTS1) is added. 
Tumblr media
This shows the fitting criteria for drinking showing that using any other sort of metric other than CV press we would add the FIGHTS variable and get a better model, but that is not the case with CV press. 
Tumblr media
We also see a smooth decrease in the squared errors as we get to step 6 when adding variable ATTEMPTS1. 
Tumblr media
The final part of the output shows our model only actually corrected for 6 out of the 4400 training samples, showing we may need more variables to get an adequate model. Our R-square value also only shows as a 0.05 so it is a pretty poor fit to the data. Also in the last table, we see confirmation that all of our variables are indeed negatively correlated with our response variable drinking. 
0 notes
higgerz-blog · 8 years ago
Text
Machine learning for data analysis: Assignment 2
My codebook I’m using for this assignment is the Addhealth codebook to generate a random tree relating alcohol consumption (drinking) seven explanatory variables ranging from fist fights to depression and suicide. The exact questions of the variables are given in the first block of the labeling in the code. Values are all set to binary for convenience. The code that generates my data is shown below:
Tumblr media Tumblr media
The output for the random tree operation is divided into three parts
Tumblr media
We can see here that the misclassification rate is 0.288 which means 71.2% of the data was classified correctly, which is pretty decent for this model. 
Tumblr media Tumblr media
We can all say based on the tree generation that the, the model has clearly converged as the tree increased it’s leaves. There has not been much improvement from the baseline, as we have only gone from 28.8% misclassification to 28.2% out of bag from doing the random tree. 
Tumblr media
This table shows the relative importance of the variables. We can see that being suspended from school has the highest association with drinking, with stealing as second, depression as third, and threaten and curfew after. Suicide attempts and serious fist fights have very low important to this model. 
0 notes
higgerz-blog · 8 years ago
Text
Machine learning for data analysis: Assignment 1
I am working with the Addhealth data set, and today doing a decision tree analysis on adolescent drinking. The 4 variables I tested the association of with drinking are “stealing something over fifty dollars” (FIFTYSTEAL), “being in a serious fist fight” (FIGHTS), feelings of depression (DEPRESS), and “attempting suicide” (ATTEMPTS1). All categorical explanatory variables were binned into two categories with 0 being not having done the activity at all, and 1 being done the activity at some level. The response categorical variable response “no” was coded to 2 as per the video instructions. The code used to generate the results is shown below:
Tumblr media Tumblr media
The results of the analysis, are shown below. First I show the decision tree.
Tumblr media
The first major split for people who drink alcohol is whether or not they stole something more than $50 dollars, there is a high weight on the “no” answer to this fork. The highest percentage of people who drink alcohol are the people that have “no”for fiftysteal and “no” for feelings of depression. Of the people who have “no” for fiftysteal the lowest percentage that drink alcohol at (1-0.67 = 0.33) is for people that haven’t attempted suicide. The highest percentage that drink alcohol which is (1-0.51 = 0.49) is where they have not stolen, are depressed and have attempted suicide. The confusion table is presented below:
Tumblr media
The Error rate for predicting “yes” (or 1) for the alcohol drinking, is 90% (10% of the data is predicted correctly). But for “no” (or 2) 97% percent of the data is predicted correctly. This brings the total error rate to (1663+152/6424) = 0.28% error rate or 72% of the data is predicted accurately by the decision tree.
Tumblr media
The ROC plot shows that this training is actually not that accurate with a AUC value of 0.59 and not really approaching a right angle, and could be improved with some different explanatory variables. 
0 notes
higgerz-blog · 8 years ago
Text
Regression modeling in practice: Assignment 4
I am using the Addhealth codebase and in this assignment I will be analyzing the relationship between explanatory variables “feelings of depression”, “serious thoughts of suicide” and their relationship to “being in a serious physical fist fight”. The variables are adapted into binary new variables. For example for the feelings of depression variable, if a person never experience feelings of depression it was coded as 0 in a new variable, if they experience moderate to severe levels, then it was coded as 1. A similar procedure follows for fist fights. “Serious thoughts of suicide” (H1SU1) is already in binary format after unuseful values are set to missing.  My hypothesis is that neither feelings of depression nor serious thoughts of suicide are correlated to fist fights. The code that produces our results from the logistic regression are shown below:
Tumblr media
The results of our logistic regression are shown below:
Tumblr media Tumblr media
As we can see, both the depression variable (beta=0.3025,p<0.0001) and thoughts of suicide (beta=0.4020,p<0.0001) are associated with fist fights, contrary to my original hypothesis. The intercept of the model is given as (beta=-0.9317,p<0.0001). Since both variables are statistically significant (p<0.0001) there is no confounder for this model. 
The odds ratios show the same story which is to say that a person who experiences depression is 1.2-1.5 times more likely to get into a serious physical fight, and people who have thoughts of suicide are 1.3-1.75 times more likely to get into a physical fight. 
0 notes
higgerz-blog · 8 years ago
Text
Regression modeling in practice: Assignment 3
The codebook I am currently using is the Addhealth codebook, and here for this linear regression assignment I am interested in the relationship between explanatory variable “feelings of depression” and response variable “number of cigarettes smoked per day in the past 30 days”. I am also analyzing possible confounders “Have you been in a serious physical fight” and “have you ever stole something worth over $50″.  All explanatory variables were manipulated so only binary results remained. So for example, 0 for the fight variable would be “never been in a fight”, and 1 would be “having been in one or more fights”. My hypothesis is that one of the variables, FIGHTS or FIFTYSTEAL confounds the relationship between depression and cigarette smoking per day in the past month. The code to produce all of the data is shown below:
Tumblr media Tumblr media
The results of the linear regression are shown below:
Tumblr media
All values will be reported as their 95% confidence intervals. The fit values for the variables are FIGHT (beta=0.673-1.213,p<0.0001), DEPRESS (beta=0.629-1.139, p<0.0001), FIFTYSTEAL (beta=2.063-3.212,p<0.0001). The intercept is of cigarettes smoked per day in the last 30 days when all explanatory variables are 0 is 0.830-1.180. The R^2 values for this fit with these variables is 0.03 which shows only 3% of the data is explained by these variables. With all 3 values of p being less than 0.0001 we can say that this result is statistically significant and that all of the explanatory variables our one of interest and the possible confounders are not associated with our response variable “number of cigarettes smoked per day in the past 30 days”. And also, because of these small p values, none of these explanatory variables confounds any of the others. This disagrees with my hypothesis as depression is not associated with cigarette smoking. On top of this neither of the other variables FIGHTS and FIFTYSTEAL confounds the original variable, opposite of the hypothesis. 
The Q-Q plot is shown below:
Tumblr media
This plot shows that at the lower and higher percentiles of our model, the residuals start to blow up becoming extremely high at the far end. This suggests that there might be higher order behavior (quadratic) in one of our explanatory variables that we can investigate. Or our explanatory variables and just not a good fit for the response variable. The latter point seems more likely 
The standardized residual plot is shown below:
Tumblr media
The standardized residuals are quite large, some even reaching 17 stds away from the mean. There is a large cluster around 0, but it’s safe to say that more than 5% are outside standard deviation of 2 and more than 1% are outside of the standard deviation of 3, so these 3 explanatory variables are just not a good model fir to this response variable. 
The leverage plot is shown below:
Tumblr media
By this leverage plot we can see that we actually have more outliers than not. There are many red (outlier) and many brown points (outliers with high leverage). But with this many outliers, especially ones of high influence or high leverage to the model that we cannot safely remove, the validity of this fit to the response variable is very much in question. I believe based on this and all of the Q-Q and standardized residual plots above these variables are a very poor model for this response variable. 
0 notes
higgerz-blog · 8 years ago
Text
Regression modeling in practice: Assignment 2
I am currently using the Addhealth codebook and for this assignment analyzing the relationship between “feelings of depression” (H1FS6) our explanatory variable and “number of cigarettes smoked per day for the last 30 days” (H1TO7) our quantitative response variable. The code which was used to produce the data for this assignment is shown below: 
Tumblr media
The explanatory variable has 4 levels so it is consolidated into a new variable DEPRESS with 2 categories. DEPRESS = 0 means never or rarely experienced depression, and DEPRESS = 1 means that the individual has experienced depression “sometimes” all the way to “all of the time”. The response variable has the legitimate skips (meaning hasn’t smoked in the past 30 days) reset to 0 since this is the same as just 0 cigarettes for the context of our question. Unuseful responses for both such as “refused” or “don’t know” have been marked as missing. The results of running the code are shown below
Tumblr media
The frequency table shows the demographic of people who experience depression almost never as opposed to some mild or severe form. We can see that this has been coded correctly as the correct demographic is 0 and the other is 1. 61.60% are in the “never or rarely” category and 39.40 % are in the “sometimes” or greater feelings of depression. 
The results of the analysis yield and R^2 value of 0.009354 and a p-value of <0.0001 meaning that this result is statistically significant and we can be sure there is extremely weak to no correlation between “feelings of depression” and “number of cigarettes smoked”. The y = mx + b equation we get from the data is: 
(cigarettes smoked in the past 30 days) = 1.019 * (”feelings of depression”) + 1.382. The p-values for both of these are also <0.0001.
But as I mentioned before this relationship is not really useful since the R^2 is so low. We can not reasonably use this to predict values of cigarette smoking based on feelings of depression. 
0 notes
higgerz-blog · 8 years ago
Text
Regression modeling in practice: Assignment 1
 Sample: 
The sample I am using is the Addhealth dataset, a national longitudinal survey of represented adolescents from grades 7 - 12 during the year 1995. Specifically I am focusing on sample surveys conducted with 6,504 adolescents  80 high schools across the United States were randomly selected from a sample of 26,666 with a sufficiently broad distribution of size, school type, census religion, level of urbanization and percent white. Specific groups within this population were studied not the individual or the aggregate.
The specific relationship I am looking at within the data set is the relationship between “feelings of depression” and “number of attempted suicides in the past 12 months”
Methods: 
The original purpose of the Add health study was a mandate by U. S. Congress to fund an adolescent health study. In home-interviews (through a person not through a written questionnaire) were performed at many households around the US in 1995. Parental consent was granted, and the interviews were conducted through computer-assisted personal interview where answers to interview questions were typed into a computer. For more sensitive questions, the question was recorded and played to the interviewee and the interviewee typed their answer in the computer themselves to increase reliability of answers. Questions regarding health, family, friends, substance use, sex, feelings, educational expectations and more were assessed during the interviews.
Measures: 
Questions of “feelings of depression in the past week” and “number of attempted suicides in the past 12 months” were studied as a part of the Addhealth in-home interviews. Feelings of depression, my explanatory variable, were consolidated into 6 categories. never or rarely, sometimes, a lot of the time, most of the time or all the time, refused, and don’t know. The refused and don’t know responses were marked as missing data. For number of attempted suicide, response variable, the responses were: 0 times, 0 times (legitimate skip, did not contemplate suicide), 1 times, 2 or 3 times, 4-5 times, 6 or more times, refused and don’t know, As with the previous variable the refused and don’t know options were omitted. For the sake of analysis of “number of attempted suicides in the past 12 months” I binned it into two categories, 0 attempted suicide, and 1 or more attempted suicides. 
0 notes
higgerz-blog · 8 years ago
Text
Data Analysis Tools Assignment 4: Moderator Test
I am using the addhealth notebook and interested in the relationship between “feelings of depression” (H1FS6) and “attempted suicides” (H1SU2) with the moderating variable being “alcohol use in the last 12 months” (H1TO15). This moderator variable is categorical so does not need to be categorized before proceeding. I think alcohol abuse may cause a more attempted suicides to occur but I would like to test if this is the case. The code that derives these results is shown below:
Tumblr media Tumblr media
Note that the explanatory variable is condensed into two categories in a seperate variable DEPRESS, 0 being never experienced depression and 1 having experienced some level of depression, mild of severe. The attempted suicides has also bee compress into a binary response variable, 0 being never attempted suicide, and 1 being attempted suicide at least 1 times.  The results of the moderator test for the 7 different levels of alcohol use (shown in the proc format statement “number_of_times_one”
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
This is a lot of data but it is broken down as follows. The first alcohol level, 1, which is “every day or almost every day”. Looking at the column percentages over the different levels of depression, not feeling depressed leads to 86.96% to not attempt suicide, and 13.04% to attempt. Feeling depressed lead to only a slightly higher percentage to not attempt with 80.56% and to attempt with 19.44%. These aren’t very different which is reflected in the high p-value. However, there is a very small amount of data, 59 entries, which gives a warning and which makes me somewhat skeptical of the overall implications of the result. The rest of the alcohol levels, are statistically significant and thus show a correlation between depression and attempted suicide, with generally decreasing p-values and increasing chi-square values (with the exception of level 4 to level 5) in the moderator variable. These tests show that for the highest level of alcohol use, the link between depression and attempted suicide may not be present but it is difficult to say with such little data. But assuredly for all other alcohol levels, down to no drinking at all, there is a correlation between depression and attempted suicide. 
0 notes
higgerz-blog · 8 years ago
Text
Data Analysis Tools Assignment 3: Pearson Correlation
My true study from  the Adhealth book uses two categorical variables, so I selected two quantitative ones to practice using the Pearson correlation. The two variables I selected were H1NM3 which was “how old were you when your mom died” as the explanatory variable and H1TO7 which is “during the past 30 days, how many cigarettes did you smoke per day?” as the response variable. The code used to produce the results is shown below
Tumblr media
The results of running the code are shown in the following tables
Tumblr media
As seen in the above table, the reported values for R and p are 0.17227 and 0.4677. This makes it safe to say that the R value is close to 0 (meaning weak correlation) and with the high p value this result is not statistically significant. Therefore we cannot reject the null hypothesis and must say that there is no correlation between these two variables.
0 notes
higgerz-blog · 8 years ago
Text
Data Analysis Tools Assignment 2: Chi Square Test
Today’s assignment will an analysis of the relationship of “feelings of depression” (H1FS6) and “attempted suicides” HISU2 from the addhealth database using the Chi Square test of independence. 
The ATTEMPTS1 variable is created since our response variable otherwise has 6 categories. The attempted suicides are divided into two categories, people who have attempted 0 times (value of 0) and people who have attempted 1 or more times (value of 1). Code to produce the data is shown below. 
Tumblr media Tumblr media Tumblr media
The first of the last 7 code blocks is the Chi Sq test of independence between feelings of depression and our newly created variable ATTEMPTS1. The null hypothesis to be tested is:
H0: The no relation between feelings of depression and suicide attempts
Ha: There is a some relationship between feelings of depression and suicide attempts
The results for this test are shown below:
Tumblr media
The p value of the Chi-Square is <0.0001 means that we can safely reject the null hypothesis and say that there IS a relationship between feelings of depression. The column percentages with value 0 (attempted suicide) for ATTEMPTS1 going up in levels of depression are 98.52%,94.98%,89.84% and 82.81% showing a near linear decrease from increasing depression. For value 1 (attempted suicide) we see increasing 1.48%, 5.02% 10.16% to `17.19% going from feelings of depression never or rarely to most of the time or all the time. All of the percentages seem to be different from this preliminary analysis. However, due to the fact that there are 4 categories for the explanatory variable:
Feelings depression
0) never or rarely
1) sometimes
2) a lot of the time
3) most of the time or all of the time
We cannot say from this data that all the categories are different. To say this, we must enumerate all pairs with additional Chi-Square independence tests. There are 6 pairs in total (hence the last 6 blocks in my code). The p-value for rejection of the null-hypothesis (or to say the two are in different categories) is modified due to the number of categories. The new p-value needed for rejection of the null hypothesis is 0.5/6 = 0.00833. The results of one of these tests is shown below (all 6 would be exhaustive) 
Tumblr media
Based on this for levels “never or rarely” or “sometimes” we can safely say that there are in different categories. We continue this test 5 more times and acquire the following data:
The two being compared are in parenthesis followed by a dash and the p-value.
(never or rarely) (sometimes) - <0.0001
(never or rarely)(a lot of the time) - <0.0001
(never or rarely)(most of the time or all the time) - <0.0001
(sometimes)(a lot of the time) - < 0.0001
(sometimes)(most of the time or all the time) - <0.0001
(a lot of the time)(most of the time or all the time) - 0.0132
Based on these we can say these levels are all in seperate categories with regards to suicide rate except for “a lot of the time” and “most of the time or all the time”
This leads to 3 categories. “Never or rarely” in category A, “sometimes” in category B, and “a lot of the time and most of the time or all of the time” in category C. 
0 notes
higgerz-blog · 8 years ago
Text
Data Analysis Tools Assignment 1: ANOVA
The dataset I am using is the Addhealth data set, and the two variables I am specifically looking at is “feelings of depression” (H1FS6) and “number of cigarettes smoked per day in the last 30 days” (H1TO7). I will be analyzing the last relationship the relationship H1FS6 and H1TO7 with feelings of depression being our categorical explanatory variable and H1TO7 being the quantitative response variable.
The code used to produce my results is shown below. Refused, or legitimate skip entries, or other unuseful entries in the data are set to missing. The labels for feelings of depression are shown in FORMAT how_often_one
Tumblr media
The frequency tables for said values for reference are shown below,
Tumblr media Tumblr media
The null hypothesis to be tested with ANOVA is as follows:
h0: The mean number of cigarettes smoked per day for the last 30 days is the same regardless of depression level.
And the alternative hypothesis which is:
ha: The mean number of cigarettes smoked per day in the last 30 days differs among varying levels of depression.
The results of the ANOVA test are shown below:
Tumblr media Tumblr media Tumblr media Tumblr media
We can see by observations on the graphical and numerical data, that the means of number of cigarettes smoked in the past 30 days do not differ much over varying levels of depression. The p value supports this at 0.054. Outlier observation values are shown by black dots marked by their observation number, which there are many.  
Based on the p-value of 0.054, this null hypothesis is just statistically significant enough for us not to be able to reject it. We cannot accept the alternate hypothesis given the result of this test, and must accept the null hypothesis that there is mean number of cigarettes smoked per day in the last 30 days are the same across the varying levels of depression.
Also due to the difference in mean not being statistically significant, post-hoc analysis is not necessary for these variables.
0 notes
higgerz-blog · 8 years ago
Text
Data Analysis and Management Assignment 4
I am currently studying the effects of depression on suicidal thoughts/attempts. For the purpose of this lesson,  I will be focusing on feelings of depression and attempted suicide. The code that was used in this assignment is shown below.
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
/* specify what data set we want from the library */
DATA new; set mydata.addhealth_pds;
/* Label the data according to human readable values */
LABEL  H1FS6 = "Feelings of depression"
  H1SU1 = "Thoughts of suicide in past year"
  H1SU2 = "Suicide attempts in past year";
  /* Uncomment to look at individuals experiencing depression more than sometimes */
* IF H1FS6 = 2 OR H1FS6 = 3;
/* Uncomment to look at individuals only depressed most or all of the time */
* IF H1FS6 = 3;
/* Make unuseful responses into missing entries */
/* H1FS6 */
IF H1FS6 = 6 OR H1FS6 = 8 then H1FS6 = .;
/* H1SU1 */
IF H1SU1 = 6 OR H1SU1 = 8 OR H1SU1 = 9 then H1SU1 = .;
/* H1SU2 */
/* If there was a legitimate skip here this means that the individual did
not attempt suicide because they did not even think about it */
IF H1SU2 = 7 then H1SU2 = -1;
IF H1SU2 = 6 OR H1SU2 = 8 then H1SU2 = .;
/* We also must collapse attempted suicides into two response categories for meaningful
information. To do this, we turn 0 times attempted into 0, and 1 or more times into 1.
The code for this is shown below */
IF H1SU2 = 0 OR H1SU2 = 9 THEN ATTEMPTS1 = 0;
ELSE IF H1SU2 GE 1 AND H1SU2 LE 4 THEN ATTEMPTS1 = 1;
PROC FREQ; TABLES H1SU1 H1SU2 ATTEMPTS1;
/* sort the data by the unique identifier */
PROC SORT; by AID;
/* Label the variables shown in the tables */
PROC FORMAT;
VALUE how_often_one
0="never or rarely"
1="sometimes"
2="a lot of the time"
3="most of the time or all the time"
6="refused"
8="don't know";
PROC FORMAT;
VALUE how_often_two
0="no"
1="yes"
6="refused"
8="don't know"
9="not applicable";
PROC FORMAT;
VALUE number_of_times
0="0 times (C)"
1="1 times"
2="2 or 3 times"
3="4-5 times"
4="6 or more times"
6="refused"
7="legitimate skip"
8="don't know"
-1="0 times (NC)";
/* As an abbreviation for ease of plotting, C for 0 times means that they contemplated suicide
and NC means that they did not
/* Get the frequancies of the specified data */
/* PROC FREQ;
FORMAT H1FS6 how_often_one. H1SU1 how_often_two. H1SU2 number_of_times.;
TABLES H1FS6 H1SU1 H1SU2; */
/* This is the code for making the vertical C-C bar chart between feelings of depression
and thoughts of suicide */
/* Independent variable is feelings of depression, and dependent variable for causal model
is thoughts of suicide */
PROC GCHART;
FORMAT H1FS6 how_often_one. H1SU1 how_often_two. H1SU2 number_of_times.;
VBAR H1FS6/Discrete type = PCT width=20;
VBAR H1SU1/Discrete type = PCT width=20;
VBAR H1SU2/Discrete type = PCT width=10;
VBAR H1FS6/Discrete type = MEAN SUMVAR=H1SU1 width=20;
VBAR H1FS6/Discrete type = MEAN SUMVAR=ATTEMPTS1 width=20;
/* Run the program */
RUN;
First let’s look at the percentages graphically for feelings of depression and number of attempted suicide.
Tumblr media
As we can see, the vast majority of the population studied, has no feelings of depression, making this a right-skewed distribution with percentage decreasing as we move to higher frequency of depression. About 60% experience depression never or rarely,  30% experience it sometimes with small percentages higher than this.
Tumblr media
Above shows the attempted suicide percentage. 0 times (NC) means they attempted 0 times and never contemplated suicide (hence the NC). and 0 times (C) means that they attempted 0 times and did contemplate suicide. About 87.5% of the individuals asked did not ever contemplate or attempt suicide. 9% thought about it and never went through with it, and the rest of about 3.5% attempted suicide 1 or more times. Again, this is another highly right-skewed distribution toward the lack of attempting suicide (as we would expect).
In order to get more meaningful information out of this data, we will use a Categorical to Caregorical bar plot. Our causal model we will propose is that depression is linked to attempted suicide. This depression is our explanatory variable and attempts at suicide is our response variable. However our attempted suicide has too many responses and we must collapse to be binary. In order to look specifically at people who thought of suicide we will exclude the “0 times (NC)” from a new variable ATTEMPTS1. ATTEMPTS1 is defined such that 0 times has a response of 0, and 1 times or more has a response of 1. This will tell use which of the people had 1 or more suicide attempts or did not attempt at all. The frequency collapse of our original variable H1SU2 into ATTEMPTS1 is shown in the following two tables.
Tumblr media Tumblr media
As we can see, this focused on a highly specific group of interest, in which we are trying to make a conclusion or association about. Is feelings of depression correlated with suicide attempts? The C-C plot of these variables is shown below:
Tumblr media
As we can see from the above figure, the mean percentage of people who attempted 1 or more times is actually fairly high even from the never or rarely answer. The slowly increases as we get to the “most of the time or all the time” category which has a mean percentage of 38%. More statistical analysis must be done, and it is hard to make a conclusion with this relatively small population, but this right-increasing nature of this plot would suggest that there is an association between depression and attempted suicides.
0 notes
higgerz-blog · 8 years ago
Text
Data Management and Visualization Assignment 3
Shown below is my code for assignment 3. 3 variables were studied and were relabeled for ease of the reader. 
Tumblr media Tumblr media
The frequency tables generated by this code are shown below and missing data etc. is explained.
My first table shows frequency in depression among the adolescents. There were two answers to this question that were particularly unuseful, which were “refused” and “don’t know”. 8 refused and 12 did not know. These two responses were set to missing, bumping the other percentages up slightly.
Tumblr media
The second table shows how often they person thinks about suicide. 46 refused, 22 said “don’t know” and one said “not applicable”. These are similarly not useful responses as they do not give us any information about the context of the question, and are thus set to missing.  
Tumblr media
For suicide attempts in the past year, we have a different problem of legitimate skip. So if the person answers that they never thought of suicide, the question about attempted suicide is skipped. This results in a very large percentage of the population (5,683 of the 6,504 asked) to have a legitimate skip. This however is different from the answer 0 times, because they had thoughts of suicide, and thus it was not skipped. So this “skip” actually contains more information, specifically that they did not think about suicide AND did not attempt suicide. We therefore create another category for this entirely in our data as “0 times and did not contemplate.” In order to differentiate, the “0 times” category is now labeled as “0 times and did contemplate” in order to make the data more meaningful. Although the 1 times and higher obviously imply that they contemplated suicide. As with the other questions there were a few non-useful responses. 3 refused to answer and 1 did not know, and these were set to missing data.
Tumblr media
This should make working with the data easier, and the conclusions more clear from future data analysis. 
0 notes
higgerz-blog · 8 years ago
Text
Data Management and Visualization Assignment 2
Variables in the code were relabeled in order to make the code more readable. Two different if statements were studied but commented out to get the full data tables.
Tumblr media
The 3 output tables generated from this program are shown below
Tumblr media
The question that I have considered last assignment was “is depression associated with suicide?” Two variables I decided to focus on associated with this from the Addheatlh codebook are thoughts of suicide and attempts of suicide. The literature search also indicated that there was a correlation between them, but some of the sources disagreed. 
The first table were answers of the question "How often have you felt depressed?" 61.4% said they never have with 28.5% saying sometimes, 6.8% saying a lot, and about 3% saying most of or all of the time. Out of the 6,504 people who responded 8 refused and 12 said they didn't know. What is striking to me is that about 38% of these adolescents said that they have felt depressed sometimes or more. The second table is answers to the question "During the past 12 months, how often have you thought of suicide?". A whopping 86.32% responded no, with only 12.6% responding yes. 22 reported they did not know and 1 reported "not applicable". I am not exactly sure what this means, and it is not explained in the codebook but it such a small percentage of the data it should not affect the results. The third table shows responses to "In the past 12 months, have many times have you attempted suicide?" The largest percentage, about 87.4% answered no for the thoughts of suicide question and thus had a legitimate skip to this question (which is the did not think about category). This large percentage will be ignored in further data analysis. Out of the remaining 12.6%, 9% did not attempt at all. 2% attempted once, 1% attempted 2-3 times with extremely small percents for higher attempts. 3 refused to answer and 1 didn't know. Out of these categories I decided I wanted to focus on the individuals who felt depressed a lot of the time, most of the time or all of the time and see the percentages that followed in the other two questions. Our of the 193 people experiencing depression “most of the time or all of the time”, 45% of them had thoughts of suicide in the past year, 6% had attempted once, 6% attempted 2-3 times, 1% 4-5 times, and 3.6% attempted 6 or more times. The 6 or more time statistic I would really like to look into in future assignments because it is perplexing that it is so high. 
0 notes
higgerz-blog · 8 years ago
Text
Data Management and Visualization Assignment 1
Topic Selection:
I have always been interested in how people are grow and change through their life, and thus I choose the codebook Add health to look at a bulk of the early development stage.
I have had a few friends growing up that suffered from depression and was always curious about how this affected people. Especially people who chose to take their own life. So I have chosen the Suicide section of the ad health codebook. More specifically, I am interested in what factors cause someones attempt at suicide, or variable H1SU2.  
After further looking into the add health codebook I am curious if number of attempted suicides are correlated with feeling depressed or variable H1FS6. Both pages involving these variables have been added to my personal codebook. My research question involving these two is “Is depression associated with suicide?”
I have done a search of the literature involving these terms in google scholar. Exact search terms used were “association between depression and suicide” and “depression causes suicide”.
Literature Review:
A study performed by Lisa Crona in BMC Psychiatry studied the long-term effects of suicide on patients. The population were people with severe depression, 42-56 years after their suicide attempt. They found that depression was a major factor in thoughts of suicide and actual suicide attempts. Many of the population gave the description of the events leading up to their suicide attempt as “being trapped in an overwhelming situation.” It was also found that recovering from being suicidal happened regardless of recovery from the depression.[1]  
An all women study from Thorton and coworkers studied the affects of environmental and genetic factors on suicide attempts. Their study involves these factors in relation to major depressive disorder (MDD). They cite that 30-90% of all people who die from suicide were suffering at one point from MDD. Interestingly, they find that genetic factors account for majority of prevalence of MDD and thus suicide attempts. Unique environmental factors also play a large role in the development of the disorder.[2]
Turecki and co-authors conducted a more general study analyzing suicide and suicidal behaviors among a wide variety of cultures, ages, sexes and locations. They also defend the claim that depression and suicide are associated with one another, especially in the elderly. They also give an exhaustive list of factors that affect risk of suicide in which hopelessness caused by depression is present.[3]
Zhang and Ziyao looked at the association between depression and suicide with hopelessness as the control. The population was informants from suicide victims and members of the chinese rural population without any sort of depressive disorder. They find that when hopelessness is controlled for, depression and suicide are not associated. They claim that the feeling of despair from hopelessness is much more correlated to risk of suicide than depression.[4]
All of these articles have at least acknowledged the importance of depression as a considered factor in whether or not an individual attempts suicide. Depression can cause other factors that are related the attempted suicide, but depression seems to always be at least somewhat involved.
Hypothesis:
Some studies think depression is a strong factor in attempted suicide and others think different factors are much more dominant but none were performed among specifically an adolescent population. But it is cited many times how important a factor it is in general and most studies do not ignore it as a possibility. Based on the above literature review, my hypothesis is that depression is a major factor in attempted suicide in teens.
References:
[1] Crona, Lisa, et al. "Taking care of oneself by regaining control-a key to continue living four to five decades after a suicide attempt in severe depression." BMC psychiatry 17.1 (2017): 69.
[2] Thornton, Laura M., et al. "Anorexia nervosa, major depression, and suicide attempts: shared genetic factors." Suicide and life-threatening behavior (2016).
[3] Turecki, Gustavo, and David A. Brent. "Suicide and suicidal behaviour." The Lancet 387.10024 (2016): 1227-1239.
[4] Zhang, Jie, and Ziyao Li. "The association between depression and suicide when hopelessness is controlled for." Comprehensive psychiatry 54.7 (2013): 790-796.
0 notes