Text
CAPSTONE PROJECT: Preliminary Statistical Findings
Descriptive Statistics:
Table 1 below shows statistics for the categorical data analytic variables of interest for the sample population. Percentage of the sample population that were involved in at least one event of recidivism, during the nine-month follow-up period, was 59%.
Bivariate Analysis
Bar graphs for the associations between recidivism response variables and CBM predictor treatment (Figure 1) revealed that recidivism of all forms occurred at a decreased rate when CBM was undertaken. Chi square analysis, used to test the association between treatment and recidivism rates, however, indicates that none of the associations were significant (see Table 2).
Figure 1: Association between CBM and rates of recidivism during 9-month follow-up period.
Table 2: Bivariate Analysis
Multivariate Analysis – Lasso Regression Analysis.
Lasso Regression analysis was used to identify a set of variables that most strongly predict recidivism among parolees. Thirty-three predictor variables associated with gender, number of minor children, education, past drug use, arrest history, and means of financial support were considered. Table Three show the 12 variables were retained in the model selected by the lasso regression analysis. Number of minor children; unemployment based financial support; other opiate, hallucinogenic, and amphetamine usage are associated with recidivism. Number of minor children, usage of other opiates and hallucinogens, financial support from partner/friend were positively associated with recidivism whereas unemployment based financial support and amphetamine usage were negatively associated.
Table 3: Lasso Least Angle Regression Selection Summary
Together, these 12 predictors accounted for only 12.8 % (R square value .1279) of the variance in recidivism suggesting the model has very weak predictive value. Additionally, Figure 4 below shows that the predictive accuracy of the lasso regression algorithm developed on the training set data (MSE=.25132) differed noticeably from the test data set (MSE=.21257). Of note, the variable of CBM treatment was identified as the 20th variable of importance.
Challenges Faced in Undertaking Preliminary Analysis:
My original intention was to analyze the effectiveness of CBM treatment for a subset of the population - women with minor children. The subset of the sample population however is not large enough to carry out this analysis so I will be rewriting my research question to reflect a more general interest in understanding recidivism for the general sample.
I would also like to include race variables in my Lasso regression analysis. The race variable data needed to be cleaning (missing data had not been recorded with a “0″ value) so initial attempts at running the data resulted in error statements. After re-coding the data and attempting to run LAR another error message appeared: Selection aborted because there are no suitable observations for training. At this point, I have not been able to solve this issue. In addition, I am uncertain as to the significance of the following statement that was printed under the LAR selection summary table - Selection stopped because all candidate effects for entry are linearly dependent on effects in the model.
0 notes
Text
CAPSTONE PROJECT: METHODS
Sample:
The sample included N= 450 drug-involved parolees who were randomly assigned to participate in CBM or usual parole system. The study was conducted from 2004 and 2008. The population from which the sample was drawn were identified as parolees who were at least 18 years of age, English speaking, and per drug screen results likely drug dependent immediately prior to incarceration. Additional requirements for inclusion included: drug treatment as recommended or mandated as part of parole conditions and identified as a moderate to high risk of recidivism on the basis of a Lifestyle Criminality Screening Form (LCSF) or prior convictions. Persons with psychotic symptoms were not included in the study.
Measures:
The primary analysis is whether CBM intervention decreases recidivism in comparison to standard parole. Recidivism was measured by any incidences of arrest, illegal drug use, commission of a crime during the nine-month follow up period. The secondary analysis considers whether there are moderating factors that influence effectiveness, such as, female gender and presence of minor children, history of drug usage, previous criminal record, major form of economic support, among other. Data regarding predictors were taken collected in intake interviews. Data on response variables for recidivism were collected through follow up interviews with parolees, data collected at/by parole offices, and toxicology results from urine tests.
Analyses:
Initial primary analysis was conducted by calculating frequencies for participation in CBM or standard parole (categorical predictor variable) and the three categorical response variables of recidivism. Chi square test for independence was used to test the bivariate association between parolee treatment (CBM or standardized parole) and three types of recidivism (the response variable).
Initial secondary analysis was conducted by calculating frequencies for participation in CBM or standard parole and the three categorical response variables for recidivism only for women with minor children. Similarly, a Chi square test for independence was used to test the bivariate association between parolee treatment (CBM or standardized parole) and three types of recidivism (the response variable) only for women with minor children.
A linear regression was used to test the association between parolee treatment and three forms of recidivism for the whole sample and the sub sample of women with minor children. Bar graphs are used to depict the relationship between frequency of recidivism to parolee treatment.
To further understand the importance of a series of predicator variables (gender, history of drug usage, previous criminal record, major form of economic support, race, among others) to evaluating the potential effectiveness of CBM, a lasso regression analysis using k-fold cross validation was performed. Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables. All predicator variables were categorical. A graph for the coefficient progression was included to depict the relative importance of the various predictors and the direct of association.
0 notes
Text
Introducing the research question.
Evaluating the Effectiveness of Collaborative Management Programs for Reducing Recidivism Among Drug-Involved Female Parolees with Minor Children.
Research Introduction: Substance use is wide-spread among persons convicted of crimes, prisoners and parolees. Prisoner and parolees do not normally receive drug abuse treatment in prison or upon release and recidivism amongst substance users is fairly high. To reduce recidivism some jurisdictions have developed addiction treatment programs: the programs utilize various models of treatment and typically bring together parole officers and treatment counselors to collaboratively introduce, offer, and motivate parolees to engage in drug treatment.
Focus Data Set: My research focuses on data collected by the Step n’ Out study, a multi-site randomized behavioral trial conducted from 2004 to 2008. The Step n’ Out study involved 450 drug-involved parolees who were randomly assigned to collaborative behavioral management or usual parole. Parolees were followed at 3 months and 9 months to assess if they were rearrested and the status of their drug usage. Step n’ Out data collection efforts included: 1) intake interviews which covered sociodemographic background, family and peer relations, health and psychological status, criminal involvement, drug use history, and HIV=AIDS risk behaviors and 2) follow up interviews specific to measuring recidivism to crime and drug relapse and involved self-report and toxicology results from urine tests. Data was also collected from parole officers and treatment counselors to understand the character, structure and process of collaboration that constitute the therapeutic alliance.
Research Question and Rationale: The purpose of analysis is twofold. First, to analyze the comparative effectiveness of the six collaborative behavioral management for reducing drug use relapse and recidivism. And second, to examine influence of moderating effects, such as participant characteristics and type of drug use. In particular, I am interested in understanding the comparative effectiveness of the program for women with minor children who have been involved in hard drugs. Other (hidden) variables of interest are: homeless status, educational status, employment status, and form of economic support. If CBM programs are effective, their wider adoption could improve outcomes for offenders and families, and also strengthen collaborations between the criminal justice and addiction treatment systems.
References: PETER D. FRIEDMANN , ELIZABETH C. KATZ , ANNE G. RHODES , FAYE S. TAXMAN , DANIEL J. O'CONNELL , LINDA K. FRISMAN , WILLIAM M. BURDON , BENNETT W. FLETCHER , MARK D. LITT , JENNIFER CLARKE & STEVEN S. MARTIN (2008) Collaborative Behavioral Management for Drug-Involved Parolees: Rationale and Design of the Step'n Out Study, Journal of Offender Rehabilitation, 47:3, 290-318, DOI: 10.1080/10509670802134184
0 notes
Text
K Means Cluster Analysis
A k-means cluster analysis was conducted to identify subgroups of countries based on their similarity of responses on 5 variables that represent characteristics that could have an impact on breast cancer incidence. Clustering variables in this case are all quantitative, measuring the consumption of alcohol, CO2 emissions, female employment rate, income per person and life expectancy.
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were not split into training and test set since there is a relatively small number of observations (214 countries).
A series of k-means cluster analyses were conducted on the data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Here is my Python code:
from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans from scipy.spatial.distance import cdist
#loading dataset data = pd.read_csv(“gapminder.csv”)
def analyse(ax): global data #setting variables you will be working with to numeric because python read from the csv file as strings(objects) data[ax] = pd.to_numeric(data[ax], errors=‘coerce’)
analyse(“breastcancerper100th”) analyse(“alcconsumption”) analyse(“co2emissions”) analyse(“femaleemployrate”) analyse(“incomeperperson”) analyse(“lifeexpectancy”)
#data management data_clean = data.dropna()
#Subset clustering variables cluster = data_clean[[ “alcconsumption”, “co2emissions”, “femaleemployrate”,“lifeexpectancy”, “employrate”, “incomeperperson”]] cluster.describe()
#standardize clustering variables to have mean=0 and sd=1 clustervar = cluster.copy()
def standardizing(x): clustervar[x] = preprocessing.scale(clustervar[x].astype(“float64”)) standardizing(“alcconsumption”) standardizing(“co2emissions”) standardizing(“femaleemployrate”) standardizing(“incomeperperson”) standardizing(“lifeexpectancy”)
# k-menas cluster analysis for 1-9 clusters clusters = range(1,10) meandist= []
for k in clusters: model = KMeans(n_clusters=k) model.fit(clustervar) clausassign = model.predict(clustervar) meandist.append(sum(np.min(cdist(clustervar, model.cluster_centers_, “euclidean”), axis=1))/clustervar.shape[0])
#Plot average distance from observations from the cluster centroid to sue the Elbow Method to identify number of clusters to choose
plt.plot(clusters, meandist) plt.xlabel(“Number of clusters”) plt.ylabel(“Average distance”) plt.title(“Selecting k with the Elbow Method”) plt.show()
The figure below shows the elbow curve of r-square values for the nine cluster solution.
So what this plot shows is the decrease in the average minimum distance of the observations from the cluster centroids for each of the cluster solutions. We can see that the average distance decreases as the number of clusters increases. Since the goal of cluster analysis is to minimize the distance between observations and their assigned clusters we want to chose the fewest numbers of clusters that provides a low average distance. What we’re looking for in this plot is a bend in the elbow that kind of shows where the average distance value might be leveling off such that adding more clusters doesn’t decrease the average distance as much.
We can see there appears to be a couple of bends at the line at two clusters and at three clusters, but it’s not very clear. To help us figure out which of the solutions is best we should further examine the cluster solutions for at least the two and three cluster solutions to see whether they do not overlap, whether the patterns of means on the clustering variables are unique and meaningful, and whether there are significant differences between the clusters on our external validation variable “breastcancerper100th”.
Here is my additional code to the tree cluster solution:
#interpret 3 cluster solution model3 = KMeans(n_clusters=3) model3.fit(clustervar) clusassign = model3.predict(clustervar)
#plot clusters pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clustervar) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1’) plt.ylabel('Canonical variable 2’) plt.title('Scatterplot of Canonical Variables for 3 Clusters’) plt.show()
#multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clustervar.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clustervar['index’]) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index’) newclus # rename the cluster assignment column newclus.columns = ['cluster’]
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clustervar, newclus, on='index’) merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
# calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster’).mean() print (“Clustering variable means by cluster”) print(clustergrp)
# validate clusters by examining cluster differences in breastcanerper100th using ANOVA # first have to merge breast cancer with clustering variables and cluster assignment data bc_data=data_clean[“breastcancerper100th”] # split breastcancer data into train and test sets bc_train, bc_test = train_test_split(bc_data, test_size=.3, random_state=123) bc_train1=pd.DataFrame(bc_train) bc_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(bc_train1, merged_train, on='index’) sub1 = merged_train_all[['breastcancerper100th’, 'cluster’]].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
bcmod = smf.ols(formula='breastcancerper100th ~ C(cluster)’, data=sub1).fit() print (bcmod.summary())
print ('means for breast cancer by cluster’) m1= sub1.groupby('cluster’).mean() print (m1)
print ('standard deviations for breast cancer by cluster’) m2= sub1.groupby('cluster’).std() print (m2)
mc1 = multi.MultiComparison(sub1[“breastcancerper100th”], sub1['cluster’]) res1 = mc1.tukeyhsd() print(res1.summary())
Results:
Canonical discriminant analyses was used to reduce the 5 clustering variable down a few variables that accounted for most of the variance in the clustering variables. Plot of the first two canonical variables for the clustering variables by cluster is showed below:
This scatter plot indicates that the clusters are not densely packed and they did not overlap very much each other, which means the observations within the clusters are not highly correlated, and there is a high variance between clusters.
Analysis:
The means on the clustering variables showed that compared to the other clusters, countries in the second cluster, cluster 1, clearly includes countries more likely to have high incidence of breast cancer due to high levels of alcohol consumption and high rates of life expectancy.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on breast cancer incidence. The results showed they do differ significantly (p-value= 0.00254).
The tukey post hoc comparisons showed significant differences between clusters on breast cancer incidence, with the exception that clusters 0 and 1 were not significantly different from each other. Countries in cluster 2 had the highest breast cancer incidence (mean=43.94, sd=22.64), and cluster 3 had the lowest (mean=23.06, sd=16.84).
0 notes
Text
Let’s see how Lasso Regression performed with the Gapminder dataset.
Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
Here is my Python code:
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
from sklearn import preprocessing
#loading the dataset data = pd.read_csv(“gapminder.csv”)
def analyse(ax): global data #setting variables you will be working with to numeric because python read from the csv file as strings(objects) data[ax] = pd.to_numeric(data[ax], errors=‘coerce’)
analyse(“breastcancerper100th”) analyse(“alcconsumption”) analyse(“armedforcesrate”) analyse(“co2emissions”) analyse(“femaleemployrate”) analyse(“hivrate”) analyse(“lifeexpectancy”) analyse(“polityscore”) analyse(“suicideper100th”) analyse(“employrate”) analyse(“urbanrate”)
#data management data_clean = data.dropna()
#I set my explanatory (predictors) and response variable (target). predvar = data_clean[[ “alcconsumption”, “armedforcesrate”, “co2emissions”, “femaleemployrate”, “hivrate”, “lifeexpectancy”, “polityscore”, “suicideper100th”, “employrate”, “urbanrate” ]]
#standardize predictors to have mean=0 and sd=1 predictors = predvar.copy()
def standardizing(x): global predictors predictors[x] = preprocessing.scale(predictors[x].astype(“float64”)) standardizing(“alcconsumption”) standardizing(“armedforcesrate”) standardizing(“co2emissions”) standardizing(“femaleemployrate”) standardizing(“hivrate”) standardizing(“lifeexpectancy”) standardizing(“polityscore”) standardizing(“suicideper100th”) standardizing(“employrate”) standardizing(“urbanrate”)
target = data_clean[“breastcancerper100th”] pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
#specify the Lasso Regression Model model = LassoLarsCV(cv=10, precompute= False).fit(pred_train, tar_train)
#The dict object creates a dictionary, and the zip object creates lists. my_dict = dict(zip(predictors.columns, model.coef_)) for k, v in my_dict.items(): print(k, v)
#plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle=“–”, color = “k”, label = “alpha CV”) plt.ylabel(“Regression Coefficients”) plt.xlabel(“-log(alpha)”) plt.title(“Regression Coefficients Progression for Lasso Paths”) plt.show()
# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ’:’) plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k’, label='Average across the folds’, linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle=’–’, color='k’, label='alpha CV’) plt.legend() plt.xlabel(’-log(alpha)’) plt.ylabel('Mean squared error’) plt.title('Mean squared error on each fold’) plt.show()
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print (’\ntraining data MSE’) print(train_error) print (’\ntest data MSE’) print(test_error)
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print (’\ntraining data R-square’) print(rsquared_train) print (’\ntest data R-square\n’) print(rsquared_test)
The output:
Analysis:
Predictors with regression coefficients that do not have a value of zero are included in the selected model. Predictors with regression coefficients equal to zero means that the coefficients for those variables had shrunk to zero after applying the LASSO regression penalty, and were subsequently removed from the model.
Python console results shows that of the 10 variables, 5 were selected in the final model.
So we can also use the size of the regression coefficients to tell us which predictors are the strongest predictors of breast cancer incidence. For example, urban rate and alcohol consumption had the largest regression coefficients, and were most strongly associated with breast cancer incidence, followed by life expectancy and CO2 emissions.
From the first plot we can see the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step, as well as the steps at which each variable entered the model.
The second plot shows the change in the mean square error for the change in the penalty parameter alpha at each step in the selection process.
The R-square values were 0.65 and 0.51, indicating that the selected model explained 65% and 51% of the variance in breast cancer incidence for the training and test sets, respectively.
0 notes
Text
Random Forest analysis
Random forest analysis was performed in Python3 to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. Random forests provide importance scores for each explanatory variable and also allow us to evaluate any increases in correct classification with the growing of smaller and larger number of trees.
The following explanatory variables were included as possible contributors to a random forest to evaluate my response variable breast cancer incidence (moderate/strong): income per person, alcohol consumption, CO2 emissions, female employ rate, life expectancy and urban rate.
Here is my Python code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
#to indicate where my data set is located
os.chdir(“/home/genara/Documents/Coursera/4_Machine_Learning/Random Forests”)
#loading the dataset
AH_data = pd.read_csv(“gapminder.csv”)
def analyse(ax):
#setting variables you will be working with to numeric because python read from the csv file as strings(objects)
AH_data[ax] = pd.to_numeric(AH_data[ax], errors=‘coerce’)
analyse(“breastcancerper100th”)
analyse(“incomeperperson”)
analyse(“alcconsumption”)
analyse(“co2emissions”)
analyse(“femaleemployrate”)
analyse(“lifeexpectancy”)
analyse(“urbanrate”)
#grouping breast cancer(response variable) into 2 categories
AH_data[“breastcategories”] = pd.cut(AH_data[“breastcancerper100th”], bins=[1, 40, 102], labels=False)
labels = np.array(“0_Moderate 1_Strong”.split())
AH_data[“breastcategories”] = pd.to_numeric(AH_data[“breastcategories”], errors='coerce’)
#Because decision tree analyses cannot handle any NA’s in our data set, my next step is to create a clean data frame that drops all NA’s.
data_clean = AH_data.dropna()
#I set my explanatory (predictors) and response variable (target).
predictors = data_clean[[“incomeperperson”, “alcconsumption”, “co2emissions”, “femaleemployrate”, “lifeexpectancy”, “urbanrate” ]]
targets = data_clean.breastcategories
#And then include the train test split function for predictors and target. And set the size ratio to 60% for the training sample and 40% for the test sample by indicating test_size=.4.
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(“The shape of the predictor training sample:”)
print(pred_train.shape)
print(“The shape of the predictor test sample:”)
print(pred_test.shape)
print(“The shape of the target training sample:”)
print(tar_train.shape)
print(“The shape of the target and test sample:”)
print(tar_test.shape)
#Build model on training data
#n_estimators are the number of trees you would build with the random forest algorithm.
classifier = RandomForestClassifier(n_estimators = 25)
#Then use this classifier.fit function which you pass the training predictors and training targets to. It’s this code that will fit our model.
classifier=classifier.fit(pred_train,tar_train)
#Next we include the predict function where we predict for the test values
predictions=classifier.predict(pred_test)
#and then call in the confusion matrix function which we passed the target test sample to.
print(“Confusion Matrix: ”)
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
#We can also look at the accuracy score
print(“Accuracy Score: ”)
print(sklearn.metrics.accuracy_score(tar_test, predictions))
#fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train, tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
#Impact of tree size on prediction accuracy
#To determine what growing larger number of trees has brought us in terms of correct classification. We’re going to use code that builds for us different numbers of trees, from one to 25, and provides the correct classification rate for each.
trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators = idx + 1)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees,accuracy)
plt.show()
The Output:
Data Analysis:
From the confusion matrix, we see the true negatives and true positives on the diagonal. The 3 and the 7 represent the false negatives and false positives, respectively. The accuracy of the random forest was 84.37%, which means that roughly 84% of the countries were classified correctly, as having a moderate or a strong breast cancer incidence.
Given that we don’t interpret individual trees in a random forest, the most helpful information to be gotten from a forest is arguably the measured importance for each explanatory variable. The explanatory variables with the highest relative importance scores were, as expected, life expectancy (with 0.3487…) and alcohol consumption (with 0.1982…).
From the graph, it can be seen that the subsequent growing of multiple trees rather than a single tree, added considerably (approximately 5 p.p.) to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be inappropriate.
0 notes
Text
Decision trees
Random Forest is a machine learning method. This data mining algorithm is based on decision trees, but proceeds by growing many trees. While decision trees are not very reproducible on future data and are proceed by searching for a split on every variable in every node, Random Forests searches for a split on only one variable in a node. The variable that has the largest association with the target among candidate explanatory variables but only among those explanatory variables that have been randomly selected to be tested for that node.
How does it work?
First, a small subset of explanatory variables is selected at random.
Next, the node is split with the best variable among the small number of randomly selected variables.
Then, a new list of eligible explanatory variables is selected on random to split on the next node. This continues until the tree is fully grown, and ideally there is one observation in each terminal mode.
The eligible variables set will be quite different from node to node.
However, important variables will eventually make it into the tree. Their relative success in predicting the target variable will begin to get them larger and larger numbers of “votes” in their favor.
The growing of each tree in a random forest is not only based on subsets of explanatory variables at each node, but also based on a random subset of the sample for each tree in the forest.
This process of selecting a random sample of observations is known as Bagging. Importantly, each tree is growing on a different randomly selected sample of Bagged data with the remaining Out of Bag data available to test the accuracy of each tree. For each tree, the Bagging Process selects about 60% of the original sample, while the resulting tree is tested against the remaining 40% of the sample. Thus, the randomly selected bag data and out of bag data, will be a different 60% and 40% of observations for each tree. The most important thing to know when interpreting results of random forests is that the trees generated are not themselves interpreted. Instead, they are used to collectively rank the importance of variables in predicting our target of interest.
Running a Classification Tree
Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.
In order to perform a classification tree with the Gapminder data set, I had to first bin my response variable (breastcancerper100th) into 2 categories since this data is quantitative.
Here is the complete Python code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import tree
from io import StringIO
from IPython.display import Image
import pydot
#to indicate where my data set is located
os.chdir(“/home/genara/Documents/Coursera/4_Machine_Learning”)
#loading the dataset
AH_data = pd.read_csv(“gapminder.csv”)
#setting variables you will be working with to numeric because python read from the csv file as strings(objects)
def analyse(ax):
AH_data[ax] = pd.to_numeric(AH_data[ax], errors=‘coerce’)
analyse(“breastcancerper100th”)
analyse(“incomeperperson”)
analyse(“alcconsumption”)
analyse(“co2emissions”)
analyse(“femaleemployrate”)
analyse(“lifeexpectancy”)
analyse(“urbanrate”)
#grouping breast cancer(response variable) into 2 categories
AH_data[“breastcategories”] = pd.cut(AH_data[“breastcancerper100th”], bins=[1, 40, 102], labels=False)
labels = np.array(“0_Moderate 1_Strong”.split())
AH_data[“breastcategories”] = pd.to_numeric(AH_data[“breastcategories”], errors='coerce’)
#Because decision tree analyses cannot handle any NA’s in our data set, my next step is to create a clean data frame that drops all NA’s.
data_clean = AH_data.dropna()
#Setting the new data frame called data_clean, I can now take a look at various characteristics of my data by using the d types and describe functions to examine data types and summary statistics
print(data_clean.dtypes)
print(data_clean.describe())
#I set my explanatory (predictors) and response variable (target).
predictors = data_clean[[“incomeperperson”, “alcconsumption”, “co2emissions”, “femaleemployrate”, “lifeexpectancy”, “urbanrate” ]]
targets = data_clean.breastcategories
#And then include the train test split function for predictors and target. And set the size ratio to 60% for the training sample and 40% for the test sample by indicating test_size=.4.
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(“The shape of the predictor training sample:”)
print(pred_train.shape)
print(“The shape of the predictor test sample:”)
print(pred_test.shape)
print(“The shape of the target training sample:”)
print(tar_train.shape)
print(“The shape of the target and test sample:”)
print(tar_test.shape)
#Build model on training data
#Once training and testing data sets have been created, we will initialize the DecisionTreeClassifier from SKLearn.
classifier=DecisionTreeClassifier()
#Then use this classifier.fit function which you pass the training predictors and training targets to. It’s this code that will fit our model.
classifier=classifier.fit(pred_train,tar_train)
#Next we include the predict function where we predict for the test values
predictions=classifier.predict(pred_test)
#and then call in the confusion matrix function which we passed the target test sample to.
sklearn.metrics.confusion_matrix(tar_test,predictions)
#We can also look at the accuracy score
sklearn.metrics.accuracy_score(tar_test, predictions)
#request the picture of our decision tree.
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
graph=pydot.graph_from_dot_data(out.getvalue())
graph.write_pdf(“breastcancer.pdf”)
Here is the output:
Here is the decision tree:
Analysis:
Let’s first look at the accuracy score, which is approximately 0.89. It suggests that the decision tree model has classified 89% of the sample correctly as either moderate cancer incidence or high cancer incidence.
For our predictor variable, the training sample has 95 rows and 6 explanatory variables, while the test sample has 64 rows and 6 explanatory variables.
The resulting tree starts with the split on X[4], our fifth explanatory variable, life expectancy. If the the value is greater than 74.231 years, we move to the right side of the split that include 33 of the 95 observations in the training sample. From this node, another split is made on our second explanatory variable, alcconsumption. In the same way, if the the alcohol consumption is less than 8.3, we move to the left side of the split that include 12 of the 33 observations in the training sample. From this node, another split is made on our third explanatory variable, co2emissions, and so the splits continue for the others explanatory variables.
It can be seen that among those countries that have life expectancy greater than 74.231 and alcohol consumption levels between 8.6 and 17.8, all of them were classified with strong cancer incidence [20, 0].
0 notes
Text
Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.
In order to perform a classification tree with the Gapminder data set, I had to first bin my response variable (breastcancerper100th) into 2 categories since this data is quantitative.
Here is the complete Python code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import tree
from io import StringIO
from IPython.display import Image
import pydot
#to indicate where my data set is located
os.chdir(“/home/genara/Documents/Coursera/4_Machine_Learning”)
#loading the dataset
AH_data = pd.read_csv(“gapminder.csv”)
#setting variables you will be working with to numeric because python read from the csv file as strings(objects)
def analyse(ax):
AH_data[ax] = pd.to_numeric(AH_data[ax], errors=‘coerce’)
analyse(“breastcancerper100th”)
analyse(“incomeperperson”)
analyse(“alcconsumption”)
analyse(“co2emissions”)
analyse(“femaleemployrate”)
analyse(“lifeexpectancy”)
analyse(“urbanrate”)
#grouping breast cancer(response variable) into 2 categories
AH_data[“breastcategories”] = pd.cut(AH_data[“breastcancerper100th”], bins=[1, 40, 102], labels=False)
labels = np.array(“0_Moderate 1_Strong”.split())
AH_data[“breastcategories”] = pd.to_numeric(AH_data[“breastcategories”], errors='coerce’)
#Because decision tree analyses cannot handle any NA’s in our data set, my next step is to create a clean data frame that drops all NA’s.
data_clean = AH_data.dropna()
#Setting the new data frame called data_clean, I can now take a look at various characteristics of my data by using the d types and describe functions to examine data types and summary statistics
print(data_clean.dtypes)
print(data_clean.describe())
#I set my explanatory (predictors) and response variable (target).
predictors = data_clean[[“incomeperperson”, “alcconsumption”, “co2emissions”, “femaleemployrate”, “lifeexpectancy”, “urbanrate” ]]
targets = data_clean.breastcategories
#And then include the train test split function for predictors and target. And set the size ratio to 60% for the training sample and 40% for the test sample by indicating test_size=.4.
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(“The shape of the predictor training sample:”)
print(pred_train.shape)
print(“The shape of the predictor test sample:”)
print(pred_test.shape)
print(“The shape of the target training sample:”)
print(tar_train.shape)
print(“The shape of the target and test sample:”)
print(tar_test.shape)
#Build model on training data
#Once training and testing data sets have been created, we will initialize the DecisionTreeClassifier from SKLearn.
classifier=DecisionTreeClassifier()
#Then use this classifier.fit function which you pass the training predictors and training targets to. It’s this code that will fit our model.
classifier=classifier.fit(pred_train,tar_train)
#Next we include the predict function where we predict for the test values
predictions=classifier.predict(pred_test)
#and then call in the confusion matrix function which we passed the target test sample to.
sklearn.metrics.confusion_matrix(tar_test,predictions)
#We can also look at the accuracy score
sklearn.metrics.accuracy_score(tar_test, predictions)
#request the picture of our decision tree.
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
graph=pydot.graph_from_dot_data(out.getvalue())
graph.write_pdf(“breastcancer.pdf”)
Here is the output:
Here is the decision tree:
Analysis:
Let’s first look at the accuracy score, which is approximately 0.89. It suggests that the decision tree model has classified 89% of the sample correctly as either moderate cancer incidence or high cancer incidence.
For our predictor variable, the training sample has 95 rows and 6 explanatory variables, while the test sample has 64 rows and 6 explanatory variables.
The resulting tree starts with the split on X[4], our fifth explanatory variable, life expectancy. If the the value is greater than 74.231 years, we move to the right side of the split that include 33 of the 95 observations in the training sample. From this node, another split is made on our second explanatory variable, alcconsumption. In the same way, if the the alcohol consumption is less than 8.3, we move to the left side of the split that include 12 of the 33 observations in the training sample. From this node, another split is made on our third explanatory variable, co2emissions, and so the splits continue for the others explanatory variables.
It can be seen that among those countries that have life expectancy greater than 74.231 and alcohol consumption levels between 8.6 and 17.8, all of them were classified with strong cancer incidence [20, 0].
0 notes
Text
Logistic regression
Logistic regression is designed to test a binary response variable.
Considering that I am using the Gap minder data set and all variables are quantitative, I had do bin it into 2 categories.
Just to remind you…
The explanatory variables are: alcohol consumption, life expectancy and female employment rate.
The response variable is breast cancer per 100th.
Here is my Python code:
import pandas
import numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = pandas.read_csv(“gapminder.csv”, low_memory = False)
def analyse(ax):
#setting variables you will be working with to numeric
#data[ax] = data[ax].convert_objects(convert_numeric=True)
data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’)
analyse(“alcconsumption”)
analyse(“breastcancerper100th”)
analyse(“lifeexpectancy”)
analyse(“femaleemployrate”)
#grouping breast cancer (response variable) into 2 categories
data[“breastcategories”] = pandas.cut(data[“breastcancerper100th”], bins=[-1, 40, 110], labels=False)
labels = numpy.array(“0_Moderate 1_Strong”.split())
data[“breastcancerper100th”] = data[“breastcategories”].astype(“category”)
#Logistic Regression female employment rate
lreg1 = smf.logit(formula = “breastcategories ~ femaleemployrate + lifeexpectancy + alcconsumption”, data= data).fit()
print(lreg1.summary())
#odds ratios
print(“Odds Ratios”)
print(numpy.exp(lreg1.params))
#Looking at the confidence intervals
params = lreg1.params
conf = lreg1.conf_int()
conf[“OR”] = params
conf.columns = [“Lower CI”, “Upper CI”, “OR”]
print(numpy.exp(conf))
Output:
Analysis:
Similar to the multiple regression output, we can see the number of observations, the name of my response variable, breastcategories, and a table with the P values. Notice that among my explanatory variables, just the variable “femaleemployrate” wasn’t significantly associated with breast cancer incidence (p-value = 0.22).
Let’s have a look at the odds ratios to better understand these results. The odds ratio is the probability of an even occurring in one group compared to the probability of an event occurring in another group. An odds ratio can range from zero to positive infinity. And is centered around the value one.
The “femaleemplurate” odd ratio is approximately equal to 1 (0.9739…), which means the model is statically non-significant. In another words, the probability of countries with high level of breast cancer incidence is the same for countries with low and high female emplyoment rate.
The odd ratio for “lifeexpectancy” is 1.38. When OR > 0 , as explanatory variable increases, response variable is more likely. Therefore, the probability of countries with high level of breast cancer increases among those with high level of female employment rate. The same conclusion we can draw from the “alcconsumption” variable. Its odd ratio is 1.22. We can say that women who drink drink high quantities of alcohol is 1.22 times more likely to develop breast cancer than women who drink alcohol occasionally.
Looking at the confidence intervals, we can get a better picture of how much this value would change for a different sample drawn from the population. The odds ratio indicates, for lifeexpectancy, that there’s a 95% certainty that the 2 population odds ratio fall between 1.21 and 1.57. Likewise, the alcconsumption odd ratio indicates hat there’s a 95% certainty that the 2 population odds ratio fall between 1.05 and 1.35.
0 notes
Text
Multiple Linear Regression
Let’s test a multiple regression model by adding the following explanatory variables “lifeexpectancy” and“femaleemployrate” to our primary explanatory variable “alcconsumption”.
Here is the Python code:
import pandas
import numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax):
#setting variables you will be working with to numeric because python read from the csv file as strings(objects)
data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’)
analyse(“alcconsumption”)
analyse(“breastcancerper100th”)
analyse(“lifeexpectancy”)
analyse(“femaleemployrate”)
#centering explanatory variable
data[“alcconsumption_c”] = (data[“alcconsumption”] - data[“alcconsumption”].mean())
data[“lifeexpectancy_c”] = (data[“lifeexpectancy”] - data[“lifeexpectancy”].mean())
data[“femaleemployrate_c”] = (data[“femaleemployrate”] - data[“femaleemployrate”].mean())
print(“Alcohol consumption mean after centering: %d” % numpy.nanmean(data[“alcconsumption_c”]))
print(“Life expectancy mean after centering: %d” % numpy.nanmean(data[“lifeexpectancy_c”]))
print(“Female employment rate mean after centering: %d” % numpy.nanmean(data[“femaleemployrate_c”]))
#multiple regression
reg1 = smf.ols(formula=“breastcancerper100th ~ alcconsumption_c + lifeexpectancy_c”, data = data).fit()
print(reg1.summary())
reg2 = smf.ols(formula=“breastcancerper100th ~ alcconsumption_c + lifeexpectancy_c + femaleemployrate_c”, data = data).fit()
print(reg2.summary())
The output:

Analysis:
As you can see from the first OLS regression results, both (alcconsumption and lifeexpectancy) P values are less than 0.05. And both of the parameter estimates are positive (5.7 and 9.3 respectively). Indicating that alcohol consumption and life expectancy is positively associated with the incidence of breast cancer, after controlling the part of the association that can be accounted for by the other. In other words, breast cancer is positively associated with the quantity of alcohol consumption after controlling for life expectancy. And, life expectancy is positively associated with number of breast cancer incidence after controlling for the consumption of alcohol. Also, lifeexpectancy does not change the the p-value of alcohol consumption, which means lifeexpectancy is not a confound variable.
When we add our third explanatory variable (femaleemployrate), we see from the second OLS regressions results, that female employment rate does not have a significant p-value (0.696), which means that there is no relationship between employment rate and breast cancer incidence. Also, the presence of this variable does not change the p-value of our primary response variable (alcohol consumption). Therefore, it cannot be considered a confound variable.
0 notes
Text
Basic liner regression modelling
This week’s challenge is to test a basic linear regression model for the association between my primary explanatory variable (alcohol consumption) and my response variable (breast cancer incidence).
Python code:
import pandas import numpy import statsmodels.api as sm import statsmodels.formula.api as smf
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax): #setting variables you will be working with to numeric because python read from the csv file as strings(objects) data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’)
analyse(“alcconsumption”) analyse(“breastcancerper100th”)
#setting missing data to NAN: data[“alcconsumption”]=data[“alcconsumption”].replace('0’, numpy.nan)
print(“Mean before centering: ”) a = numpy.nanmean(data[“alcconsumption”]) print(numpy.float64(a))
#centering explanatory variable sub1 = numpy.subtract(data[“alcconsumption”], a)
print(“Mean after centering: %d” % numpy.nanmean(sub1))
#multiple regression print(“OLS regression model for the association between alcohol consumption and breast cancer incidence”) reg1 = smf.ols(formula=“breastcancerper100th ~ sub1”, data = data).fit() print(reg1.summary())
Output:
Note that I centered my quantitative explanatory variable. Centering involves subtracting the mean of the variable, from the actual value of the variable, for each for each observation.
As you can see from the python output, I checked that by calculating the mean after centering, which must be equal to zero.
Interpretation of OLS table:
“Dep. Variable” shows the name of the response variable.
“No. Observations” shows the number of observations with valid data in both response and explanatory variable.
“Prob (F-statistic)” is the p-value. In this case, it is really small: 1.09e-11, which means we can reject the null hypothesis and conclude that alcohol consumption is significant associated with breast cancer incidence.
If we analyze the column “coef”, we will find the equation for the best fit line of this data:
alcconsumption = 37.4 + 2.25*breastcancerper100th
The “p>\t\ “ stands for the the p-value for Pearson Correlation. In this case, 0.000
The OLS model also gives us the R-squared, which is the fraction of the variability of one variable that can be predicted by the other. In this case, it is approximately 24%.
Conclusion: The results of the linear regression model indicated that alcohol consumption ( p=.0001) was significantly and positively associated with breast cancer incidence.
0 notes
Text
Regression Modeling in Practice - Writing about your data
Describing the sample
Describe the study population (who or what was studied).
Local, national and country level statistics about social, economic, and environemental were collection from multiple sources (detailed below).
Report the level of analysis studied (individual, group, or aggregate).
The data was retrieved on the aggregate level of the entire country for each country.
Report the number of observations in the data set.
The data set contains a total of 213 observations.
Describe your data analytic sample (the sample you are using for your analyses).
I chose to subset the full dataset to exclude countries that have a femaleemployementrate below 50% and where the rate of breastcancer is above 20%.
Describe the procedures that were used to collect the data.
Report the study design that generated that data (for example: data reporting, surveys, observation, experiment).
< The answer to the below questions serve as good explainations for this question. >
Describe the original purpose of the data collection.
“It seeks to increase the use and understanding of statistics about social, economic and environmental development at local, national and global levels“.
Describe how the data were collected. Report where the data were collected.
GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.
The data was collected on the entire country level, this was not a survey but a data retrieval.
Report when the data were collected.
The data was collected in 2010, based on the 2010 Gross Domestic Product per capita in constant 2000 US$; World Bank Wold Development Indicators.
Describe your variables.
Describe what your explanatory and response variables measured. Describe the response scales for your explanatory and response variables.
Response variables:
breastcancerper100TH (2002 breast cancer new cases per 100,000 female Number of new cases of breast cancer in 100,000 female residents during the certain year.)
Explanatory variables:
femaleemployrate: 2007 female employees age 15+ (% of population) Percentage of female population, age above 15, that has been employed during the given year. (Main Source: International Labour Organisation)
alcoconsumptionr: 2008 alcohol consumption per adult (age 15+), litres Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol . (Main source: WHO)
Describe how you managed your explanatory and response variables.
I chose to subset the data to exclude Countires that have a femaleemployementrate below 50% and where the rate of breastcancer is above 20%.
0 notes
Text
Statistical interaction
We have learned that moderator is a third variable that affects the direction and/or strength of the relationship between the explanatory and response variable.
As I am interested in find out if there is a correlation between alcohol consumption and breast cancer, I decided to consider the variable “lifeexpectancy” as my moderator.
All my data set from Gapminder is quantitative. Therefore, the Pearson Correlation Test was applied. The results showed a p-value of 2.7745539272726357e-11, which implies that there is relationship between the two variables. The person coefficient was 2.7745539272726357e-11, which indicates a moderate and positive relationship between the variables.
However, does this correlation differ among people with low, middle and high life expectancy?
To answer that question, I grouped the quantitative variable “lifeexpectancy” into 3 categories:
1) low life expectancy : < = 60 years
2) middle life expectancy: > 61 years and < 75 years
3) high life expectancy: >= 75
Then, I performed the Person correlation test for each category, followed by its respective scatter plot.
Here is my code in Python3:
import pandas import numpy import seaborn import matplotlib.pyplot as plt # library to calculate the pearson correlation import scipy
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax): #setting variables you will be working with to numeric data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’)
analyse(“alcconsumption”) analyse(“breastcancerper100th”) analyse(“lifeexpectancy”)
# it drops all missing data. We cannot calculate pearson correlation in the presence of NAN values. data_clean = data.dropna()
#Pearson correlation test print(“Association between alcohol consumption and breast cancer incidence (r and p-value)”) print(scipy.stats.pearsonr(data_clean[“alcconsumption”], data_clean[“breastcancerper100th”]))
#set blank data to our third variable to NAN: data[“lifeexpectancy”] = data[“lifeexpectancy”].replace(“ ”,numpy.nan)
#grouping my moderate variable into 3 categories def lifeexpect(row): if row[“lifeexpectancy”] <= 60: return 1 if row[“lifeexpectancy”] < 75: return 2 if row[“lifeexpectancy”] >= 75: return 3
data_clean[“expectgroup”] = data_clean.apply (lambda row: lifeexpect (row),axis=1) chk1 = data_clean[“expectgroup”].value_counts(sort=False, dropna=False)
print(“ ”) print(“Life expectancy levels: ”)
print(chk1)
sub1 = data_clean[(data_clean[“expectgroup”]==1)] sub2 = data_clean[(data_clean[“expectgroup”]==2)] sub3 = data_clean[(data_clean[“expectgroup”]==3)]
#pearson correlation test for each group: def pearson(a, b): print(“Association between alcohol consumption and breast cancer incidence (r and p-value) for ” + b) print(scipy.stats.pearsonr(a[“alcconsumption”], a[“breastcancerper100th”])) print(“ ”)
pearson(sub1, “low life expectancy”) pearson(sub2, “low middle expectancy”) pearson(sub3, “low high expectancy”)
#basic scatterplot for quantitative variables: (fit_reg=True : dislplay the linear pattern) def graph(a,b): scat1= seaborn.regplot(x=“alcconsumption”, y=“breastcancerper100th”, fit_reg=True, data=a) plt.xlabel(“alcohol consumption”) plt.ylabel(“breast cancer incidences per 100th”) plt.title(“Scatterplot for the association between breast cancer and alcohol consumption for ” + b) plt.show()
graph(sub1, “low life expectancy”) graph(sub2, “low middle expectancy”) graph(sub3, “low high expectancy”)
The output:
Association between alcohol consumption and breast cancer incidence (r and p-value) (0.47970374089117873, 2.7745539272726357e-11)
Life expectancy levels: 1 38 2 79 3 55 Association between alcohol consumption and breast cancer incidence (r and p-value) for low life expectancy (0.21020592195462254, 0.2052556664431508)
Association between alcohol consumption and breast cancer incidence (r and p-value) for low middle expectancy
(0.53464945340427583, 3.8663199572728271e-07)
Association between alcohol consumption and breast cancer incidence (r and p-value) for low high expectancy (0.35845998973152771, 0.0072033613538604168)
Analysis:
For the “low life expectancy” group we found a correlation coefficient of 0.21020592195462254 e a p-value of 0.2052556664431508, which is not significant. For the “middle life expectancy” group the “r” was 0.53464945340427583 and the p-value was 0.53464945340427583, which means a marginal and positive correlation. For the “high life expectancy” group the “r” appeared as 0.35845998973152771 and a significant p-value was found at 0.0072033613538604168.
In another words, we can say that “life expectancy” moderates the relationship between alcohol consumption and and breast cancer incidence, since the correlation for low life expectancy differs from middle and high life expectancy.
P.S. Just to remind you: Post hoc tests are not applied for Pearson Correlation Tests.
0 notes
Text
Pearson Correltation
Finally I can apply the right statistical tool for my data.
As you know, I am working with the Gapminder data set. My research questions is about the relationship between alcohol consumption and breast cancer. Both variables are quantitative. Therefore, the Pearson Correlations test is the right test to be used to answer my research question.
I used the following code in Python3:
import pandas import numpy import seaborn import matplotlib.pyplot as plt # library to calculate the pearson correlation import scipy
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax): #setting variables you will be working with to numeric data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’)
analyse(“alcconsumption”) analyse(“breastcancerper100th”)
#basic scatterplot for quantitative variables: (fit_reg=True : dislplay the linear pattern) scat1= seaborn.regplot(x=“alcconsumption”, y=“breastcancerper100th”, fit_reg=True, data=data) plt.xlabel(“alcohol consumption”) plt.ylabel(“breast cancer incidences per 100th”) plt.title(“Scatterplot for the association between breast cancer and alcohol consumption”) plt.show()
# it drops all missing data. We cannot calculate pearson correlation in the presence of NAN values. data_clean = data.dropna()
print(“Association between alcohol consumption and breast cancer incidence (r and p-value”) print(scipy.stats.pearsonr(data_clean[“alcconsumption”], data_clean[“breastcancerper100th”]))
Output:
Association between alcohol consumption and breast cancer (0.46939522679894924, 7.2959251804978671e-11)
Analysis:
As we learned, we always have to interpret the result of the scatter plot and the Pearson correlation coefficient ® together.
In this case, the scatter plot presents a positive relationship between the variables. In other words, an increase in alcohol consumption is associated with an increase in breast cancer incidence. As the scatter plot suggests, data follow a linear shape, so we can access the strength of the relationship by interpreting the Pearson correlation coefficient ®.
The Pearson correlation coefficient is 0.46939522679894924, which means a positive and moderate linear relationship between the variables.
The p-value here is 7.2959251804978671e-11 (<0.05), which means we have enough evidence to reject the null hypothesis and accept the alternate hypothesis that says there is a relationship between alcohol consumption and breast cancer.
Now, let’s calculate the fraction of the variability of one variable that can be predicted by the other using r2.
r2 = 22%
It means that if we know the alcohol consumption rate, we can predict 22% of the variability we will see in breast cancer rate.
0 notes
Text
Chi Squared test
Now it’s time to practice the Chi-Square Test of Independence. This test is used when both response and explanatory variables are categorical.
As I chose to work with the Gapminder data set, I had to categorize all my quantitative variables. They were categorized as follow:
Explanatory variable: alcohol consumption
0_Rarely, 1_Sometimes, 2_Often, 3_Very_Often, 4_Almost_Daily, 5_Daily
Response Variable: breast cancer per 100th
0_Moderate, 1_Strong
Just to remember, here is my research topic:
Is alcohol consumption associated with woman breast cancer?
Hypothesis:
H0: There is no relationship between the two categorical variables. The breast cancer incidence are independent of alcohol consumption.
H1: There is relationship between the variables. The breast cancer incidence is correlated with the alcohol consumption.
I ran the Chi-Square Test and post hoc comparisons using Python with the following code:
import pandas import numpy import matplotlib.pyplot as plt import seaborn #to calculate Chi Square Test we need to import the following library: import scipy.stats
data = pandas.read_csv(“gapminder.csv”, low_memory = False)
def analyse(ax): #setting variables you will be working with to numeric #data[ax] = data[ax].convert_objects(convert_numeric=True) data[ax] = pandas.to_numeric(data[ax], errors=‘coerce’) analyse(“alcconsumption”) analyse(“breastcancerper100th”)
#grouping alcohol consumption (explanatory variable) into categories data[“categories”] = pandas.cut(data[“alcconsumption”], bins=[-1, 2, 5, 7, 10, 12, 25], labels=False) labels = numpy.array(“0_Rarely 1_Sometimes 2_Often 3_Very_Often 4_Almost_Daily 5_Daily”.split()) data[“categories”] = labels[data[“categories”]] data[“alcconsumption”] = data[“categories”].astype(“category”)
#grouping breast cancer (response variable) into categories data[“breastcategories”] = pandas.cut(data[“breastcancerper100th”], bins=[-1, 40, 110], labels=False) labels = numpy.array(“0_Moderate 1_Strong”.split()) data[“breastcancerper100th”] = data[“breastcategories”].astype(“category”)
#setting missing data: data[“categories”] = data[“categories”].replace(9, numpy.nan) data[“breastcancerper100th”] = data[“breastcancerper100th”].replace(99, numpy.nan)
#tell python to not break lines in the table. pandas.set_option('display.expand_frame_repr’, False)
#contingency table of observed counts ct1 = pandas.crosstab(data[“breastcategories”], data[“categories”]) print(ct1)
#column percentages colsum = ct1.sum(axis=0) colpct = ct1/colsum print(colpct)
#chi-square print(“chi-square value, p-value and expected counts”) cs1= scipy.stats.chi2_contingency(ct1) print(cs1)
#set explanatory variable to category data[“categories”] = data [“categories”].astype(“category”) data[“breastcategories”] = pandas.to_numeric(data[“breastcategories”], errors='coerce’)
#generate graph seaborn.factorplot(x = “categories”, y = “breastcategories”, data = data, kind = “bar”, ci= None) plt.xlabel(“Alcohol consumption”) plt.ylabel(“Breast cancer”) plt.show()
#post hoc test
def posthoc(x,y): recorde2 = {x:x, y:y} data[“new”] = data[“categories”].map(recorde2) z = pandas.crosstab(data[“breastcategories”], data[“new”]) print(z)
colsum = z.sum(axis=0) colpct = z/colsum print(colpct)
print(“chi-square value, p-value and expected counts”) w= scipy.stats.chi2_contingency(z) print(w)
posthoc(“0_Rarely”, “1_Sometimes”) posthoc(“0_Rarely”, “2_Often”) posthoc(“0_Rarely”, “3_Very_Often”) posthoc(“0_Rarely”, “4_Almost_Daily”) posthoc(“0_Rarely”, “5_Daily”) posthoc(“1_Sometimes”, “2_Often”) posthoc(“1_Sometimes”, “3_Very_Often”) posthoc(“1_Sometimes”, “4_Almost_Daily”) posthoc(“1_Sometimes”, “5_Daily”) posthoc(“2_Often”, “3_Very_Often”) posthoc(“2_Often”, “4_Almost_Daily”) posthoc(“2_Often”, “5_Daily”) posthoc(“3_Very_Often”, “4_Almost_Daily”) posthoc(“3_Very_Often”, “5_Daily”) posthoc(“4_Almost_Daily”, “5_Daily”)
Comments about the code:
As you can see, I used a function to perform my post hoc comparisons. Therefore I didn’t have to write the same code for all 15 groups of comparison. As a result my code got shorter and easier to understand.
The output:
Analysis:
To make this analysis we will rely on the p-value. The p-value for the chi-square test is the probability of getting counts like observed assuming that the two variables are not related.As we can see from the output (first figure), the p-value is 7.8140560474987615e-09, which is < than 0.05. Therefore, we can reject the null hypothesis and perform a post hoc test to determine which groups are different from the others.
First, we need to calculate the Bonferroni Adjustment for the p-value:
p-value = 0.05/(num comparisons)
p-value = 0.05/15 = 0.0033
As we can see from the output (second and third figures), the following groups present significant difference (p-value < 0.0033):
0_Rarely 3_Very_Often (p-value = 0.0023697660078802517)
0_Rarely 4_Almost_Daily (p-value = 8.7439723394398583e-05)
0_Rarely 5_Daily (p-value = 4.6297312509504069e-06)
1_Sometimes 3_Very_Often (p-value = 0.002387548492290665)
1_Sometimes 4_Almost_Daily (p-value = 8.8888361096168557e-05)
1_Sometimes 5_Daily (p-value = 7.6901414335224755e-06)
2_Often 5_Daily (p-value = 0.0012924321827553879)
0 notes
Text
Running ANOVA
Recall question:
“Is alcohol consumption associated with breast cancer?”
Define ANOVA:
As we learned, Analysis of Variance are used when we have quantitative response variables and categorical explanatory variables. The objective is to examine differences in the mean response variable for each category of our explanatory variable.
The dataset used to answer my research question, unfortunately, did not contain any categorical value. Therefore, i had to synthesize my own in order to run an ANOVA test on the data (keep reading for more!!).
Creating categorical variables out of nowhere:
My categories for alcohol consumption were defined as follow:
0_Rarely
1_Sometimes
2_Often
3_Very_Often
4_Almost_Daily
5_Daily
As you can see, we are working with more than two levels of a categorical, explanatory variable.
H0: all means are equal or there is no relationship between breast cancer and alcohol consumption
Ha: not all means are equal or there is relationship between breast cancer and alcohol consumption
My code for running ANOVA:
# Importing necessary libraries
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
# Data gathering and formatting
data = pandas.read_csv(“gapminder.csv”, low_memory = False)
def analyse(ax):
#setting variables you will be working with to numeric
data[ax] = data[ax].convert_objects(convert_numeric=True)
analyse(“alcconsumption”)
analyse(“breastcancerper100th”)
analyse(“femaleemployrate”)
#grouping alcohol consumption into categories
data[“categories”] = pandas.cut(data[“alcconsumption”], bins=[-1, 2, 5, 7, 10, 12, 25], labels=False)
# initialise categorical variables
labels = numpy.array(“0_Rarely 1_Sometimes 2_Often 3_Very_Often 4_Almost_Daily 5_Daily”.split())
data[“categories”] = labels[data[“categories”]]
data[“alcconsumption”] = data[“categories”].astype(“category”)
#setting missing data:
data[“categories”] = data[“categories”].replace(9, numpy.nan)
data[“breastcancerper100th”] = data[“breastcancerper100th”].replace(99, numpy.nan)
#using ols function for calculating the F-statistic and associate p-value:
model1 = smf.ols(formula=“breastcancerper100th ~ C(categories)”, data= data)
results1 = model1.fit()
print(results1.summary())
sub2 = data[[“breastcancerper100th”, “categories”]].dropna()
print (“means for alcohol consumption”)
m1= sub2.groupby(“categories”).mean()
print (m1)
print (“standard deviations for alcohol consumption”)
sd1 = sub2.groupby(“categories”).std()
print (sd1)
My output:
Analysis:
In this case the p-value is really small: 4.68e-13 (thats what she said!). Therefore, the data provide enough evidence to reject the null hypothesis. In another words, the breast cancer is related to alcohol consumption.
My code to the Post Hoc Test:
#post hoc Tukey’s Test: mc1 = multi.MultiComparison(sub2[“breastcancerper100th”], sub2[“categories”]) res1 = mc1.tukeyhsd() print(res1.summary())
The output:
Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================== group1 group2 meandiff lower upper reject ————————————————————– 0_Rarely 1_Sometimes -1.9146 -14.25 10.4207 False 0_Rarely 2_Often 2.2193 -11.6011 16.0397 False 0_Rarely 3_Very_Often 21.4342 8.6633 34.2052 True 0_Rarely 4_Almost_Daily 33.8168 14.7372 52.8964 True 0_Rarely 5_Daily 27.1051 14.0784 40.1318 True 1_Sometimes 2_Often 4.1339 -10.3 18.5679 False 1_Sometimes 3_Very_Often 23.3488 9.9164 36.7813 True 1_Sometimes 4_Almost_Daily 35.7314 16.2028 55.26 True 1_Sometimes 5_Daily 29.0197 15.3438 42.6956 True 2_Often 3_Very_Often 19.2149 4.407 34.0228 True 2_Often 4_Almost_Daily 31.5975 11.0985 52.0965 True 2_Often 5_Daily 24.8858 9.8567 39.9148 True 3_Very_Often 4_Almost_Daily 12.3826 -7.424 32.1892 False 3_Very_Often 5_Daily 5.6709 -8.3991 19.7409 False 4_Almost_Daily5_Daily -6.7117 -26.6842 13.2607 False
Analysis:
The differences among the group means can be seen in the column reject, where True stands for rejecting the null hypothesis.
0 notes
Text
Making graphs for your data
STEP 1: Create graphs of your variables one at a time (univariate graphs). Examine both their center and spread.
Univariate analysis
Here is my code in Python:
#import the libraries
import pandas import numpy import seaborn import matplotlib.pyplot as plt
#Read the data
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax): #setting variables you will be working with to numeric data[ax] = data[ax].convert_objects(convert_numeric=True) # standard deviation and other descriptive statistics for quantitative variables desc1= data[ax].describe() print(desc1) print(“median”) median1 = data[ax].median() print(median1)
print(“mode”) mode1 = data[ax].mode() print(mode1)
#Univariate histogram for quantitative variable: seaborn.distplot(data[ax].dropna(), kde = False) plt.xlabel(ax) plt.title(“The graph of ” + ax) plt.show()
analyse(“alcconsumption”) analyse(“breastcancerper100th”)
Output:
Descriptive statistics for alcconsumption:
count 213.000000 mean 5.872864 std 5.087257 min 0.000000 25% 1.030000 50% 5.000000 75% 9.500000 max 23.010000 median 5.0 mode 0 0
Descriptive statistics for breastcancerper100th: count 173.000000 mean 37.402890 std 22.697901 min 3.900000 25% 20.600000 50% 30.000000 75% 50.300000 max 101.100000 median 30.0 mode 0 28.1
Summary:
The 1st graph shows that the higher the alcohol consumption, the lower the number the number of people that drink that amount.
The second graph also shows a similar trend however it is not as clear as there are spikes.
STEP 2: Create a graph showing the association between your explanatory and response variables (bivariate graph). Your output should be interpretable (i.e. organized and labeled).
Bivariate analysis
Here is my code:
import pandas import numpy import seaborn import matplotlib.pyplot as plt
data = pandas.read_csv(“gapminder.csv”)
def analyse(ax): #setting variables you will be working with to numeric data[ax] = data[ax].convert_objects(convert_numeric=True)
analyse(“alcconsumption”) analyse(“breastcancerper100th”)
#basic scatterplot for quantitative variables: (fit_reg=True : dislplay the linear pattern)
scat1= seaborn.regplot(x=“alcconsumption”, y=“breastcancerper100th”, fit_reg=True, data=data) plt.xlabel(“alcohol consumption”) plt.ylabel(“breas cancer incidences per 100th”) plt.title(“Scatterplot for the association between breast cancer and alcohol consumption”) plt.show()
#quartile split print(“Alcohol consumption - 4 categories - quartiles”) data[“alccategories”]= pandas.qcut(data.alcconsumption, 4, labels=[“1=25%tile”, “2=55%tile”, “3=75%tile”, “4=100%tile”]) c10 = data[“alccategories”].value_counts(sort=False, dropna=True) print(c10)
#bivariate bar graph for categorical and quantitative variables seaborn.factorplot(x=“alccategories”, y=“breastcancerper100th”, data = data, kind=“bar”, ci=None) plt.xlabel(“alcohol consumption”) plt.ylabel(“breas cancer incidences per 100th”) plt.show()
Here is the output:
Alcohol consumption - 4 categories - quartiles 1=25%tile 54 2=55%tile 53 3=75%tile 53 4=100%tile 53
Summary:
This shows that alcoholconsumption and breast cancer are positively correlated.
0 notes