pustolovka-blog - Tumblr blog

pustolovka-blog · 4 years

Text

K-means clustering using add_health dataset

Following the last week of the Machine Learning for Data Analysis course, I conducted k-means clustering analysis using the add_health dataset. I focused on the examples from the course, looking at different variables that could have an impact on students’ GPA. Clustering variables included violent behavior, alcohol consumption, marijuana consumption, school connectedness, family connectedness, parental presence, depression, self esteem, parental activity and alcohol problems. All variables were standardized to have a mean of 0, and standard deviation of 1. The code is presented bellow.

Data was further split into a training and a test dataset, using 70% for training, and 30% for testing. From there, using Euclidian distance, I conducted a series of k-means cluster analysis specifying k=1-9 clusters. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. The code is bellow:

As a result, I got a plot visualizing the potential number of clusters. The output is bellow:

The elbow curve was inconclusive, suggesting that the 2, 3 or 9 clusters could be interpreted. I opted for interpreting 3 using the following code:

As a result I got a scatterplot with the two cannonical varables as shown in the scatterplot bellow:

Cluster three (turquoise) was quite distinct, having little overlap with the other two, however, there was significant variance withing the cluster. Clusters one (yellow) and 2 (purple) showed little variance within the clusters, but there was some overlap among them. Based on the results, it is possible that two clusters would suffice for interpreting this data.

I finished my analysis by merging variables and interpreting the output using the following code:

As a conclusion, I could see that clusters 1 and 2 (in the table bellow) showed greatest difference. In cluster 1 were adolescents who scored high on alcohol and marijuana consumption, violence, depression and deviant behaviour and had low self esteem, low parental activity and family connectedness. Cluster 1 represents adolescents encountering trouble and difficulties in their lives. On the other hand, in cluster 0, there were adolescents who scored low on alcohol and marijuana consumption, violence, deviant behavior and similar, while scoring high on self esteem, parental activity, family connectedness etc. These were adolescents who were less or not troubled in their lives. Cluster 2 included those who had average results on most of the variables.

In order to validate the results, I conducted Analysis of Variance (ANOVA) to to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. The code is bellow:

As a result, I could see that there were significant differences between the clusters on GPA (picture bellow). The tukey post hoc comparisons showed significant differences between clusters on GPA, with the exception that clusters 0 and 2 were not significantly different from each other. Adolescents in cluster 2 had the highest GPA (mean=2.99, sd=0.73), and cluster 1 had the lowest GPA (mean=2.43, sd=0.79).

0 notes

pustolovka-blog · 4 years

Text

Week 3: Lasso Regression

Following the course of week 3 on Lasso Regression, I conducted the task using the add_health dataset. Compared to the analysis conducted in the lectures, I choose my target/response variable to be depression (DEP1), not school connectedness (SCHCONN1). Analysis connected in this assignment shows which other variables are most connected to depression.

I wrote the following code:

As a result, I first got a list of variables that had significant connection to my target variable DEP1, as well as a list of those that had no significance and were thus removed by lasso regression.

Strongly connected:

‘ESTEEM1’: -1.7652286145736722,

'SCHCONN1’: -1.085865883493359,

'FAMCONCT’: -0.76871421576888954

Not significant:

'ALCEVR1’: 0.1536519510237751,

'COCEVER1’: 0.135117961131035,

'GPA1’: -0.090492604677195485,

NAMERICAN’: 0.095304431061106198,

'CIGAVAIL’: 0.064324369645845383,

'PARPRES’: -0.060793321061830156,

'ASIAN’: 0.047412176135257091,

'MAREVER1’: 0.0050799568064496927,

'HISPANIC’: 0.0,

'BLACK’: 0.0,

'EXPEL1’: 0.0,

'INHEVER1’: 0.0,

My code for visualizing this shows the mentioned factors in the plot bellow. Self-esteem, school connectedness and family connectedness are all negatively associated with depression.

Mean square error on each fold shows that that MSE levels off and becomes flat (stable) at 3, indicating that only 3 folds are required.

Also looking at MSE, my analysis showed similar results for both training (29.56) and test datasets (30.94), as well as r-squared values (training 0.296, and test 0.324). The R-squares of .3 indicate moderate model fit for this LASSO regression.

0 notes

pustolovka-blog · 4 years

Text

Machine Learning for Data Analysis - Random Forests

Progressing with the course of Machine Learning for Dana Analysis, this week I conducted random forests analysis on the add_health dataset. Following the course instructions, after uploading the dataset, I selected all the variables such as sex, race, ethnicity, alcohol consumption, marijuana consumption, GPA, relationship with parents etc (see the full list in the code attached).

The code I wrote was the following:

Similar to the results shown in the lectures, my output showed that the variable most connected to regular smoking was previous consumption of marijuana 0.132), followed by deviant behaviour (0.0772) and GPA (0.0720). On the other hand, least significant variables are variables nativeamerican and asian.

When checking accuracy, my analysis showed that the initial tree had an accuracy slightly higher that 82%, while after running random forests, the accuracy increased only slightly to somewhat bellow 84%. This shows that running one tree would have been sufficiently accurate.

0 notes

pustolovka-blog · 5 years

Text

Machine Learning for Data Analysis - Decision Trees

Following the Data Analysis and Interpretation Specialization, I started the course focusing on machine learning. While I had been working with the Gapminder dataset in the previous courses of this specialization, for this assignment I followed the suggested add health data set. Given that I was not as acquainted with the data, I primarily focused on the examples given in class for delivering this assignment.

The code I used was the following:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

data = pd.read_csv ("/Users/noraborealis/anaconda3/tree_addhealth.csv")

""" Data Engineering and Analysis """

#Load the dataset

AH_data = pd.read_csv("tree_addhealth.csv") data_clean = AH_data.dropna()

data_clean.dtypes data_clean.describe()

""" Modeling and Prediction """ #Split into training and testing sets

predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape pred_test.shape tar_train.shape tar_test.shape

#Build model on training data

classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

After many technical challenges with running Graphviz both on Mac and on Windows, I finally got the desired result - graphic visualization of my decision tree. The goal of the decision tree was to test nonlinear relationships between a binary categorical variable and many explanatory variables. The code written tested all possible separations and cut points, and the result was the following:

In order to account for a diverse array of factors contributing to smoking experimentation, variables such as age, race, gender, use of marijuana, alcohol, inhalants, violence, self-esteem, parental socio-economic status (receiving social aid), and several others (see the code) were used. As python does not allow pruning, the tree is very complex containing many leaves and difficult to understand, thus not very useful for further understanding the data.

0 notes

pustolovka-blog · 5 years

Text

Gapminder Dataset - Testing Logistic Regression Model

When I first started using the GapMinder dataset, I was curios to understand whether countries with greater income per person also spend more electricity. After doing multiple regression analysis, I understood that there were potential confounding factors such as urban rate and polity score. I decided to test these using logistic regression model.

Since my variable was quantitative, I first had to change it into categorical where 0 indicated low electricity consumption and 1 indicated high electricity consumption. The method of binning the variable into two categories was done in order to fulfill the assignment using the median value, however, I do not feel that these categories show the complexity of energy consumption distribution.

Then I moved on to conducting logistic regression analysis. Here is my code:

My output

Two our of three of my explanatory variables were found to have a significant positive influence on electric consumption

urban rate (Beta=0.049, P=0.022)

income per person (Beta=0.0006, P<0.000)polity score

Polity score did not show positive influence on electric consumption (Beta=-0.039, P=0.418)

When it comes to odds ratios, my output again confirms the previous results. The polity score close to 1 (0.96) shows that polity score does not have significant influence on electric consumption. Here, income per person ratio is also close to 1 (1,0006) which implies it is less significant, and urban rate is slightly above 1 (1,051) showing some significance and indicating that countries with greater urbanization rate have greater electric consumption.

Given the low scores it is difficult to draw any specific conclusions from this type of model. As indicated above, given the type of quantitative data used, I do not find the logistic regression model as most fitting for drawing conclusions.

0 notes

pustolovka-blog · 5 years

Text

Working on the Gapminder dataset, for the third week of the course I conducted polynomial regression analysis using the rate of electricity consumption and income.

My conclusion after having conducted multiple regression models is that the hypothesis that income and energy consumption are correlated is correct. In addition, I have learned that the correlation is not linear, and that while low income countries have low electricity consumption, energy consumption grows with income, with the exception of some low income countries that have very high energy consumption. This can be attributed to emerging economies which still have low income per capita, but an increase in population and demand for electricity.

In the beginning, I created both first and second order polynomials and displayed them on a scatterplot in order to see whether a linear or a curved model better fits the data.

The scatterplot showed me that a curved line catches the nonlinear nature of the association better.

In order to adapt my model to the results, I first centered the variables I was testing and then ran simple, quadratic and cubic regression analysis to see how the model changes.

My output was the following:

Simple regression analysis

Looking at the results, I could see that the p value was less than 0.5 and that the parameter estimate of 4.7 was showing a positive correlation between income and electric consumption. Furthermore, R-squared indicated that model was capturing 42% of the variability.

Quadratic regression analysis

After introducing the quadratic term of the electricity consumption variable, the model improved. Both p values were lower than 0.5 and the parameter estimate showed that the curve began at a lower point, went up and then went down again, just as the scatterplot showed. The R-squared value also increased indicating that the model was capturing 63% of the variability.

The warning displayed in the model is expected given that variable electric consumption squared is of course correlated with the variable electric consumption. Both variables are kept in the model in order to account for the curved line.

Cubic regression analysis

In order to check for further nonlinear aspects of the model, I also ran a cubic regression analysis. While the p value and the parameter estimate showed significant correlation, the R-squared value decreased compared to the model with quadratic regression.

In order to test the multiple regression model, I first added another variable - urban rate and then conducted a qqplot test.

After inserting a new variable, the regression analysis output I received was the following:

I could see that the p values of electric consumption and electric consumption squared stayed significant even after adding a new variable. Furthermore, I could see that the intercept value for income variable was significant, and that the R-squared value was high.

The results of the q-q plot test were the following:

The q-q test showed that most residuals followed a straight line, with the exception of some at the very top and bottom of the line. This shows that other factors could be attributed to the variability, not just income and urban rate.

After the q-q plot test, I tested standardized residuals:

The output I got was the following:

Based on the results, I could see that 95% of the countries fall between two standard deviations. However, one country appears to be an extreme outlier falling beyond 3 standard deviations.

Finally, I conducted the leverage plot test:

The output I got was the following:

The leverage plot shows that there are several outliers that fall outside the 2 standard deviations. However, the plot also shows that there leverage is almost insignificant. One observation that does have high leverage (201) is not an outlier which makes the model quite sound.

0 notes

pustolovka-blog · 5 years

Text

Linear Regression Analysis using GapMinder dataset

Since the first course, my interest was in understanding the association between income level (explanatory variable) and energy consumption (response variable) in different countries. GapMinder dataset uses solely quantitative data which means that in this assignment I did the following:

1) ran code to find the mean for my explanatory variable “incomeperperson”

The mean for the explanatory variable before centering was 8740.97:

2) centred the mean to 0 (or close to 0)

3) ran code to test that the mean was centered to 0 and got the following output:

4) ran linear regression analysis for the variables of income and electricity consumption:

5) my output was the following:

The conclusion of this linear regression analysis is that there is significant association given that the f-statistic is 94.47, and the associated p-value is very low 4.63e-17. The intercept is 3391.2, and the slope coefficient is 4.7. The analysis was done on 130 observations and the dependent variable is income level. This confirms that there is a positive association between income level and electric consumption.

0 notes

pustolovka-blog · 5 years

Text

Understanding the association between income level and electricity consumption

Sample

The sample used for understanding the association between income level and electricity consumption comes from the GapMinder dataset. Promoting sustainable global development, GapMinder collects information on various factors related to the living conditions of a society in all 192 UN member states, and additional 24 areas. Data generated relates to factors such as HIV rate, gross domestic product, unemployment rates etc.

The population studied are individual countries (192) plus the additional 24 areas (such as West Bank and Gaza). Given the complexity of the variables (15 in total), and the diverse sources of information, the dataset does not contain information on each of the factors for each of the countries studied.

Procedure

The original purpose of data collection was the promotion of sustainable development. Data was collected using data reporting to the different reliable agencies and sources - United Nations Statistics Division, World Bank, Institute for Health Metrics and Evaluation and US Census Bureau’s International Database. Data was collected between 2002 and 2011. The collection of data was done by the specific agencies - alcohol consumption: WHO, female employment rate: ILO, GDP per capita: World Bank, etc.

Measures/Variables

In order to understand whether there was an association between income level and electricity consumption, I looked at the GDP per capita factor in constant 2000US$ and residential electricity consumption per person in kWh. In order to better understand the data, I excluded those countries which did not have information on one or the other factor, and grouped the income variable into: poverty, low, middle and high income countries. I also managed the electricity consumption variable by creating categories of very low, low, middle and high electricity consumption. I tested the variables both as categorical and numerical using different analysis tools, tested for confounding variables, and always got a positive association between income and electricity consumption - higher income countries have greater electricity consumption.

0 notes

pustolovka-blog · 5 years

Text

Testing a moderator using the correlation coefficient in the Gapminder dataset

In order to understand whether the correlation between income level and electric consumption in different countries is related to an external third factor, I used the Pearson correlation coefficient and included the variable showing the level of democracy (polityscore). This variable is expressed on a scale from -10 to 10, so I divided countries into two categories: low democracy and high democracy. From there I conducted my analysis.

#testing for moderation

def polity (row): if row ["polityscore"] <= 0: return 1 elif row ["polityscore"]<=10: return 2

mydata_clean["polity"] = mydata_clean.apply(lambda row: polity (row), axis=1) chk1 = mydata_clean["polity"].value_counts(sort=False, dropna = False)

low = mydata_clean[(mydata_clean["polity"]==1)] high = mydata_clean[(mydata_clean["polity"]==2)]

print ("association between income level and electric consumption for low democracy countries") print (scipy.stats.pearsonr(low["incomeperperson"], low["relectricperperson"]))

print ("association between income level and electric consumption for high democracy countries") print (scipy.stats.pearsonr(high["incomeperperson"], high["relectricperperson"]))

My results show a correlation in both categories, with the one in high democracy countries being stronger. The p-value associated with both categories is strong as well.

association between income level and electric consumption for low democracy countries (0.6448264326053613, 5.105940761759414e-05) association between income level and electric consumption for high democracy countries (0.8466751098198433, 7.709054663738328e-26)

This is also visible on the scatterplot:

0 notes

pustolovka-blog · 5 years

Text

Pearson correlation for the Gapminder dataset

Continuing the work on the gapminder dataset, I was interested in understanding whether there was a correlation between electric consumption and income per person, as well as electric consumption and employment rate. For this purpose I conducted the Pearson correlation test:

mydata_clean = mydata.dropna()

print ("association between electric consumption and income") print (scipy.stats.pearsonr(mydata_clean["incomeperperson"], mydata_clean["relectricperperson"]))

print ("association between electric consumption and employment") print (scipy.stats.pearsonr(mydata_clean["employrate"], mydata_clean["relectricperperson"]))

In the output I could see a positive and strong correlation between income levels and electric consumption, and a weak correlation between employment rate and electric consumption.

association between electric consumption and income (0.6536076842568537, 4.5973586072831085e-17) association between electric consumption and employment (0.1437964201546844, 0.10399697594348797)

In conclusion, higher income results in greater electricity consumption. The r coefficient is 0.65, and the p-value is extremely high with 4.5973586072831085e-17. If we square the r coefficient, we have a 42% chance to predict electric consumption.

0 notes

pustolovka-blog · 5 years

Text

Chi-Square test of independence

Working with the Gapminder dataset, for the purpose of this assignemnt I worked with previously created categories of income (poverty, low, medium, high), and electric consumption (very low, low, medium high).

As the instructions in the course imply, running a 4x4 chi-square test is usually not done in practice, I here did it for the purpose of the assignment.

When doing the initial chi-square test, the very low p value of 0.0015 stipulated that there was reason to abolish Ho that there was no connection between income and electric consumption. In other words it indicated that there was connection between the two variables.

In order to understand which of the categories of electric consumption and income were connected, I conducted the post hoc Bonferroni test on each of the explanatory variables (electric consumption.

As a result, I could confirm that there was significant connection between the category of poverty and very low electric consumption, as well as middle and high income and high electric consumption.

0 notes

pustolovka-blog · 5 years

Text

Analysis of variance on the Gapminder dataset

Continuing the work on the Gapminder dataset, I was interested in understanding whether electric consumption was higher in countries with higher income. During the previous course, I had grouped countries by income in 4 categories (poverty, low income, middle income and high income countries) which I used as the categorical variable.

My null hypothesis was therefore that there is no statistical significance between the mean values of electric consumption in each group. The alternate hypothesis was that there is statistical significance.

Using the OLS method, I got a clear result that the probability that there is significant difference is very high. With the result 1.46e-14, it meant that the p value was 0.0000000000000146 that the hypothesis null was false and that the alternate hypothesis was correct. This meant that there was a correlation between income and electric consumption.

As I had more than 2 categories, I also did a post hoc test to determine in which of the categories there was statistical significance.

As the results show, the alternate hypothesis is correct for the category of poverty and low, middle and high income, as well as for the group of low and middle income. It is however not correct for the categories of low income and high income, as well as for middle income and high income.

0 notes

pustolovka-blog · 5 years

Text

Visualizing data

For assignment 4, I have created univariate graphs, and bivariate graphs for the Gapminder variables of income per person, electric consumption rate, and employment rate that I selected in the beginning.

The univariate graph for income per person shows a skewed right distribution indicating the there most countries have very low income, and that few have medium, high or very high income per person. Furthermore, we see a unimodal distribution where the majority of countries are in the category with lowest income.

For the variable of urban electric consumption per person we see a similar tendency as with income. Most countries have a low electric consumption ratio and the graph is skewed right with the rest following a decrease in electric consumption.

With employment rate variable, the distribution is somewhat different. We see a bimodal distribution, with the majority of values centered around the lowest and somewhat high employment rate.

When it comes to my research question, I was interested to know whether countries with greater income and greater employment rate, also have greater electrical consumption habits. Changing my variables into quantitative, I did a scatterplot for my bivariate graphs.

Using the variables of income and electric consumption, I can see that there is a positive correlation indicating that countries with lower income use less electricity, whereas countries with higher income spend more electricity. This proves my hypothesis.

For the correlation of employment rate and electric consumption I cannot see any tendencies. Employment rate is distributed somewhat equally and does not correlate with electric consumption habits.

0 notes

pustolovka-blog · 6 years

Text

Managing data in the Gapminder dataset

As it turns out, struggling to finish the assignment in week 2, resulted in me doing the work for week 3 in advance. Anyhow, given that I had already applied some of the data management options, i used this week to refine the details.

My primary decision regarding data management was grouping or binning as it best suited the dataset. Firstly, I selected my own range for binning each of the variables - incomeperperson, electricalconsumption and employment rate. Doing this gave me more creadible data when using the value_counts function.

income binning:

0-5000

5001-10000

10001-15000

15001-20000

electric consumption binning

0-50

51-100

101-500

501-1000

employment binning

20-40

40-60

60-80

80-100

Based on these categories/bins I managed to do frequency distribution and extrapolate count and percentage of these categories.

In conclusion, my data shows that income inequality globally is rather high. More than half of the world’s countries (115, meaning 61%) fall in the lowest income category, while only 7 (meaning 0.03%) fall in the highest income category. When it comes to electric consumption, there are more countries that are in the highest consumption categories, than in the lower ones. In total 33% of the countries are in the upper electricity consumption groups. Here, it is important to note that more than half of that data is missing (55% are nan). Finally, regarding employment, there are no countries in the lowest employment category, and 52% of the countries are in the middle category regarding employment.

0 notes

pustolovka-blog · 6 years

Text

Analyzing the Gapminder dataset in Python

Working with the Gapminder dataset imposed several challenges that the course did not provide solutions to. Given that this dataset does not give specific values that can be counted, I had to dig deeper and find a suitable solution.

After struggling a lot with identifying quantifiable variables in this dataset, I found the solution in grouping up certain values (income, electric consumption and employment rate) in 4 categories. I used the binning option, or as it is defined in the program pandas.cut function to create meaningful categories I could work with. Bellow, I explain the way I ran my program, and the results I got.

I started with the regular functions as explained in the course.

Then I moved on to creating categories, or labels as written in the program. Using the coerce function, I removed unanswered fields and with the pandas.cut function I created 4 labels for each of the variables I looked into.

Only after doing this, I could use the value.counts function for doing frequency distribution with my data. Following the course instructions, and adding the bins=4 option, I successfully got both the count and percentage for each of my variables.

RESULTS:

1. Results for the variable incomeperperson when using pandas.cut function.

2. Results for the variable electric consumption per person also using pandas.cut function

3. Results for the employment rate variable using pandas.cut function.

Finally, once I managed to separate these variables and categories, I did the frequency distribution and got following results.

What my results currently show is that the majority of the countries in the dataset follow bellow the median: 169+18, meaning 79% countries in the category poverty and 8% in the category low income. The same applies to electric consumption and employment. As this does not reflect the reality, I will adapt the parameters as the course advances.

As I would still like to keep my research question open, I did not select rows, as that would eliminate certain countries. I will definitely look into removing certain countries from the analyses at a later point.

0 notes

pustolovka-blog · 6 years

Text

Does greater income result in higher residential electricity consumption?

Both my educational background and my current job are intrinsically linked to sustainable development, the challenges we face and possibilities we have to improve the societies we live in. For these reasons, I have chosen to look at the Gapminder dataset.

I am interested in understanding people’s consumption habits, especially their energy consumption. Furthermore, I am interested in knowing whether higher income per person leads to greater energy consumption. It is my hypothesis that people in developed countries consume more energy which leads to greater extraction of resources and environmental damage. Therefore, based on the available data in the data set, I will look at the variables of income per person and residential electricity consumption and see whether there is a correlation. It is my hypothesis that countries with higher income per person ratio will have greater residential electricity consumption per person.

Research shows that development and growth are directly linked to greater energy consumption (Brown et al. 2011, Yalcintas and Kaya 2017). Furthermore, there are critical questions of whether current trends of population and economic growth can be sustained by extracting resources necessary to provide energy to the modern day societies (Brown et al. 2011). And while energy and energy consumption can relate to many different areas, most energy, whether deriving from traditional or renewable sources is transformed into electricity (Liu et al. 2016). Providing necessary energy for the use of different household appliances, electricity is directly linked to the consumption of refrigerators, air conditioning devices, stoves, owens and many other devices in countries around the world (Bouznit et al. 2018, alcintas and Kaya 2017, Liu et al. 2016). Based on this literature, there is a clear correlation in countries’ development represented in the income per person variable and residential energy consumption.

In this light, using the two variables (income per person and residential electricity consumption), I wish to understand the electricity consumption practices of Serbian citizens in relation to other, more and less developed countries in the data set.

1 note · View note