Don't wanna be here? Send us removal request.
Text
Assignment 4 Data Analysis and Visualisation
Code :-
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy import seaborn import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] #print('Description of 3 variables \n') #print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
#print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) #print ('\nFrequency table for original W1_A11 - how many times watch National News') #print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)
#recoding values for W1_A11 into a new variable, TIMES_WATCH_NAT_NEWS - 1..7 recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7, -1:0} sub2['TIMES_WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1)
#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["TIMES_WATCH_NAT_NEWS"] = sub2["TIMES_WATCH_NAT_NEWS"].astype('category')
seaborn.countplot(x="TIMES_WATCH_NAT_NEWS", data=sub2); plt.xlabel('Number of times watch national news per week ') plt.ylabel('Number of responses') plt.title('Univariate Graph of Number of times watch national news per week') fig = plt.gcf() fig.savefig('TIMES_WATCH_NAT_NEWS') plt.show()
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) #print ('\nFrequency table for new recoded variable WATCH_NAT_NEWS') #print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) #print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] #print ('\nFrequency table for new variable HAS_INTERNET') #print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) #print(c1)
print ('\nFrequency table for original variable PPNET') print('- has internet or not') print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["PPNET"] = sub2["PPNET"].astype('category')
seaborn.countplot(x="PPNET", data=sub2); plt.xlabel('Has internet or not - 0 indicate no, 1 indicates yes') plt.ylabel('Number of responses') plt.title('Univariate Graph of Internet access or not') fig = plt.gcf() fig.savefig('HASINTERNET') plt.show()
#print ('\nFrequency table for W1_D9') #print('- Hillary Clinton popularity') #print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
#print ('\nFrequency table for variable W1_D9') #print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1) #print('\n\n')
#Univariate histogram for quantitative variable: seaborn.distplot(sub2["W1_D9"].dropna(), kde=False); plt.xlabel('Popularity of Hillary Clinton') plt.ylabel('Number of responses') plt.title('Univariate Graph of Popularity of Hillary Clinton') fig = plt.gcf() fig.savefig('HILLARYPOPULARITY') plt.show()
sub2["W1_D9"].dropna()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="TIMES_WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Times watch National News from life dataset 2012 study') plt.xlabel('Times watch national news') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYTIMES') plt.show()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to whether watch national news from life dataset 2012 study') plt.xlabel('Watch national news- 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYNEWS') plt.show()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="HAS_INTERNET", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Internet Access from life dataset 2012 study') plt.xlabel('Internet Access') plt.xlabel('Has Internet - 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYINTERNET') plt.show()
The univariate graph of number of times watch TV per week:
This is a unimodal graph , with mode at its highest peak at the watching national news 0 times per week. The second highest peak is at 7 times perweek It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories except for the final category 7 times per week which is second highest.
The univariate graph of the categorical variable internet access.
This graph is unimodal, with its highest peak at having internet access, with 1750 respondants of approximately 2250 having internet access.
The univariate graph of quantifiable variable Hillary Clinton Popularity:
This graph is unimodal, with its highest peak at the category of approximately 85*% popularity for 460 respondants. The graph seems to be skewed to the left as higher responses for greater values in popularity.
Bivariate Graph of Number of times watch National News with Hillary Clinton Popularity
The graph above plots the number of times watch national news per week to level of Hillary Clinton popularity. We can see that the bar chart graph does not show a clear relationship/trend between the two variables, although the graph is slightly skewed to the left as the popularity increases as number of times watching tv increases.
0 notes
Text
Assignment 4
Code :-
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 07:53:09 2018
@author: oonagh.obrien
"""
# -*- coding: utf-8 -*-
"""
Created on Mon Sep 21 10:18:43 2015
@author: jml
"""
# ANOVA
import numpy
import pandas
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable
# Check how many rows and columns in data
data = pandas.read_csv('ool_pds.csv',low_memory=0)
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# create new view of data frame with nulls removed
# include
# W1_D9 Hillary Clinton Popularity
# W1_C1 - Republican (1), Democrat (2), Independant (3), Something Else (4), Refused (-1)
# W1_B4 - Extremely Angry (1), Very Angry (2), Somewhat Angry (3),
# A little Angry (4), Not Angry at all (5), Refused (-1)
sub1 = data[['W1_C1','W1_B4','W1_D9']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
# Convert data to numberic
sub2['W1_C1']= pandas.to_numeric(sub2['W1_C1'])
sub2['W1_B4']= pandas.to_numeric(sub2['W1_B4'])
sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
#print ('\nFrequency table for original W1_C1 - Political allegianc')
sub2['W1_C1']=sub2['W1_C1'].replace(-1, numpy.nan)
c1 = sub2['W1_C1'].value_counts(sort=False, dropna=False)
#print(c1)
# W1_D9 - Hillary Clinton Popularity
# recode missing values -1 refused or 998 don't recognise to python missing (NaN)
sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan)
sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
#print ('\nFrequency table for variable W1_D9')
#print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan')
#print ('\nFrequency table for original W1_D9 - Hillary Clinton Popularity')
c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False)
#print(c1)
#print ('\nFrequency table for original W1_B4 - Level of anger')
#recode from increasing number to decreasing anger to increasing number increasing anger
recode1 = {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}
sub2['W1_B4']= sub2['W1_B4'].map(recode1)
sub2['W1_B4']=sub2['W1_B4'].replace(-1, numpy.nan)
c1 = sub2['W1_B4'].value_counts(sort=False, dropna=False)
#print(c1)
#Anova for popularity and political allegiance
model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=sub2).fit()
print (model1.summary())
#Anova for popularity and political allegiance
model1 = smf.ols(formula='W1_D9 ~ C(W1_B4)', data=sub2).fit()
#print (model1.summary())
groupdata1 = sub2[['W1_D9','W1_B4']].dropna()
groupdata2 = sub2[['W1_D9','W1_C1']].dropna()
#Output means for level of anger
print ('Means for level of anger')
m1= groupdata1.groupby('W1_B4').mean()
#print (m1)
#Output std dev for level of anger
print ('Std Dev for level of anger')
m1= groupdata1.groupby('W1_B4').std()
#print (m1)
#Output means for political affiliation
print ('Means for political affiliation')
m1= groupdata2.groupby('W1_C1').mean()
print (m1)
#Output std dev for political affiliation
print ('Std Dev for political affiliation')
m1= groupdata2.groupby('W1_C1').std()
print (m1)
#seaborn.factorplot(x="W1_B4", y="W1_D9", data=groupdata1, kind="bar", ci=None)
#plt.xlabel('Level of Anger')
#plt.ylabel('Hillary popularity')
#plt.title('Level of Anger v Hillary C popularity')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=groupdata2, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity')
plt.show()
# Consider moderator of level of anger - subgroup data depending on
# level of anger
# Not angry at all
grp1=sub2[(sub2['W1_B4']==1)]
# A little angry
grp2=sub2[(sub2['W1_B4']==2)]
# Somewhat angry
grp3=sub2[(sub2['W1_B4']==3)]
# Very Angry
grp4=sub2[(sub2['W1_B4']==4)]
# Extremely Angry
grp5=sub2[(sub2['W1_B4']==5)]
print ('association between Political affiliation and HC popularity for Not angry')
model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp1).fit()
print (model1.summary())
print ('association between Political affiliation and HC popularity for A little angry')
model2 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp2).fit()
print (model2.summary())
print ('association between Political affiliation and HC popularity for Somewhat angry')
model3 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp3).fit()
print (model3.summary())
print ('association between Political affiliation and HC popularity for Veryangry')
model4 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp4).fit()
print (model4.summary())
print ('association between Political affiliation and HC popularity for Extremely angry')
model5 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp5).fit()
print (model5.summary())
print ("means for Popularity by Political affiliation Not angry")
m1= grp1.groupby('W1_C1').mean()
print (m1)
print ("means for Popularity by Political affiliation A little angry")
m2= grp2.groupby('W1_C1').mean()
print (m2)
print ("means for Popularity by Political affiliation Somewhat angry")
m3= grp3.groupby('W1_C1').mean()
print (m3)
print ("means for Popularity by Political affiliation Very angry")
m4= grp4.groupby('W1_C1').mean()
print (m4)
print ("means for Popularity by Political affiliation Extremely angry")
m5= grp5.groupby('W1_C1').mean()
print (m5)
print()
print()
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp1, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Level of Anger v Hillary C popularity not angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp2, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity a little angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp3, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Level of Anger v Hillary C somewhat angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp4, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity very angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp5, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity very angry')
---------------------------------
Output :-
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.271
Model: OLS Adj. R-squared: 0.270
Method: Least Squares F-statistic: 266.3
Date: Sat, 21 Jul 2018 Prob (F-statistic): 6.24e-147
Time: 10:33:39 Log-Likelihood: -9906.3
No. Observations: 2154 AIC: 1.982e+04
Df Residuals: 2150 BIC: 1.984e+04
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 39.2125 1.346 29.140 0.000 36.574 41.851
C(W1_C1)[T.2.0] 39.3153 1.514 25.976 0.000 36.347 42.283
C(W1_C1)[T.3.0] 19.8492 1.701 11.668 0.000 16.513 23.185
C(W1_C1)[T.4.0] 12.3853 2.848 4.349 ��0.000 6.801 17.970
==============================================================================
Omnibus: 152.720 Durbin-Watson: 1.964
Prob(Omnibus): 0.000 Jarque-Bera (JB): 186.379
Skew: -0.677 Prob(JB): 3.37e-41
Kurtosis: 3.495 Cond. No. 7.39
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Not angry
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.104
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 10.15
Date: Sat, 21 Jul 2018 Prob (F-statistic): 2.38e-06
Time: 10:33:39 Log-Likelihood: -1212.6
No. Observations: 267 AIC: 2433.
Df Residuals: 263 BIC: 2448.
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 55.3125 5.719 9.672 0.000 44.052 66.573
C(W1_C1)[T.2.0] 24.3249 5.981 4.067 0.000 12.549 36.101
C(W1_C1)[T.3.0] 11.7036 6.415 1.825 0.069 -0.927 24.334
C(W1_C1)[T.4.0] 8.4097 7.860 1.070 0.286 -7.067 23.886
==============================================================================
Omnibus: 47.995 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 72.160
Skew: -1.074 Prob(JB): 2.14e-16
Kurtosis: 4.368 Cond. No. 10.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for A little angry
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.191
Model: OLS Adj. R-squared: 0.187
Method: Least Squares F-statistic: 41.70
Date: Sat, 21 Jul 2018 Prob (F-statistic): 3.36e-24
Time: 10:33:39 Log-Likelihood: -2386.0
No. Observations: 533 AIC: 4780.
Df Residuals: 529 BIC: 4797.
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 44.8333 3.296 13.603 0.000 38.359 51.308
C(W1_C1)[T.2.0] 33.5203 3.482 9.628 0.000 26.681 40.360
C(W1_C1)[T.3.0] 17.3363 3.865 4.486 0.000 9.744 24.928
C(W1_C1)[T.4.0] 21.0490 6.140 3.428 0.001 8.988 33.110
==============================================================================
Omnibus: 63.753 Durbin-Watson: 2.073
Prob(Omnibus): 0.000 Jarque-Bera (JB): 87.388
Skew: -0.864 Prob(JB): 1.06e-19
Kurtosis: 3.974 Cond. No. 10.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Somewhat angry
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.282
Model: OLS Adj. R-squared: 0.278
Method: Least Squares F-statistic: 89.63
Date: Sat, 21 Jul 2018 Prob (F-statistic): 6.07e-49
Time: 10:33:39 Log-Likelihood: -3099.6
No. Observations: 690 AIC: 6207.
Df Residuals: 686 BIC: 6225.
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 40.5978 2.260 17.965 0.000 36.161 45.035
C(W1_C1)[T.2.0] 37.8421 2.497 15.154 0.000 32.939 42.745
C(W1_C1)[T.3.0] 20.0949 2.817 7.133 0.000 14.564 25.626
C(W1_C1)[T.4.0] 13.1522 5.871 2.240 0.025 1.625 24.680
==============================================================================
Omnibus: 84.449 Durbin-Watson: 1.932
Prob(Omnibus): 0.000 Jarque-Bera (JB): 126.149
Skew: -0.843 Prob(JB): 4.05e-28
Kurtosis: 4.244 Cond. No. 9.06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Veryangry
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.231
Model: OLS Adj. R-squared: 0.226
Method: Least Squares F-statistic: 41.42
Date: Sat, 21 Jul 2018 Prob (F-statistic): 2.04e-23
Time: 10:33:39 Log-Likelihood: -1951.3
No. Observations: 417 AIC: 3911.
Df Residuals: 413 BIC: 3927.
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 41.8913 2.730 15.343 0.000 36.524 47.258
C(W1_C1)[T.2.0] 35.3237 3.338 10.582 0.000 28.762 41.885
C(W1_C1)[T.3.0] 15.4104 3.656 4.215 0.000 8.224 22.597
C(W1_C1)[T.4.0] 9.7174 6.105 1.592 0.112 -2.284 21.719
==============================================================================
Omnibus: 28.366 Durbin-Watson: 1.755
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.526
Skew: -0.680 Prob(JB): 8.65e-08
Kurtosis: 3.147 Cond. No. 6.03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Extremely angry
OLS Regression Results
==============================================================================
Dep. Variable: W1_D9 R-squared: 0.343
Model: OLS Adj. R-squared: 0.334
Method: Least Squares F-statistic: 40.32
Date: Sat, 21 Jul 2018 Prob (F-statistic): 5.21e-21
Time: 10:33:39 Log-Likelihood: -1131.3
No. Observations: 236 AIC: 2271.
Df Residuals: 232 BIC: 2284.
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 28.0800 3.402 8.253 0.000 21.377 34.783
C(W1_C1)[T.2.0] 51.2836 4.973 10.313 0.000 41.486 61.081
C(W1_C1)[T.3.0] 19.1668 4.780 4.010 0.000 9.749 28.585
C(W1_C1)[T.4.0] -4.0244 7.733 -0.520 0.603 -19.261 11.212
==============================================================================
Omnibus: 10.111 Durbin-Watson: 2.182
Prob(Omnibus): 0.006 Jarque-Bera (JB): 5.862
Skew: �� 0.205 Prob(JB): 0.0533
Kurtosis: 2.346 Cond. No. 4.91
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for Popularity by Political affiliation Not angry
W1_B4 W1_D9
W1_C1
1.0 1.0 55.312500
2.0 1.0 79.637427
3.0 1.0 67.016129
4.0 1.0 63.722222
means for Popularity by Political affiliation A little angry
W1_B4 W1_D9
W1_C1
1.0 2.0 44.833333
2.0 2.0 78.353591
3.0 2.0 62.169643
4.0 2.0 65.882353
means for Popularity by Political affiliation Somewhat angry
W1_B4 W1_D9
W1_C1
1.0 3.0 40.597826
2.0 3.0 78.439904
3.0 3.0 60.692771
4.0 3.0 53.750000
means for Popularity by Political affiliation Very angry
W1_B4 W1_D9
W1_C1
1.0 4.0 41.891304
2.0 4.0 77.215054
3.0 4.0 57.301724
4.0 4.0 51.608696
means for Popularity by Political affiliation Extremely angry
W1_B4 W1_D9
W1_C1
1.0 5.0 28.080000
2.0 5.0 79.363636
3.0 5.0 47.246753
4.0 5.0 24.055556
Political Affiliation v Hillary Clinton Popularity for respondents who are extremely angry
Analysis :-
Anova to examine association between Hillary Clinton's popularity rating and political affiliation, has a very small p-value and significant f-statistic which indicates a statistically significant relationship between the two variables.
Further examination of the mean, standard deviation, the mode and graphing the association indicates that highest value for popularity is when political affiliation is democrat and lowest when republican.
The variable W1_C1- 'level of anger' was considered as a possible moderator of the association between 'political affiliation' and 'popularity of Hillary Clinton' and this was investigated by making a subgroup for each level of anger - not angry, somewhat angry,
Anova was done for relationship between 'political affiliation' and 'popularity of Hillary Clinton'. It was found that every subgroup had a very small p-value and large f-statistic which indicated a statistically significant relationship between the two variables.
From examing the means and standard deviations and graphing the relationships between 'political affiliation' and 'popularity of Hillary Clinton' for each of the subgroups it appeared that 'level of anger' was a moderator for the association,, with popularity decreasing for those who were not democrats as level of anger increased.
0 notes
Text
Generating Correlation Coefficient do determine correlation between Number of times Watch National News Per week and Hillary Clinton’s popularity
Generating Correlation Coefficient do determine correlation between Number of times Watch National News Per week and Hillary Clinton’s popularity
------------------------------
CODE
-------------------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)
# create new view of data frame with nulls removed # include quantifiable variable W1_A11 number of times per week national news watched # on tv or internet and quantifiable variable W1_D9 - Hillary Clinton popularity
sub1 = data[['W1_A11','W1_D9']] print('Description of 2 variables \n') print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) print ('\nFrequency table for original W1_A11 - how many times watch National News') print('after recoding missing values to NaN')
#Recode values for times watched from 0 to 7 rather than 1 to 8 as easier #to interpet on graph recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7} sub2['W1_A11_RECODE']= sub2['W1_A11'].map(recode1) c1 = sub2['W1_A11_RECODE'].value_counts(sort=False, dropna=False) print(c1)
print ('\nFrequency table for W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
print ('\nFrequency table for variable W1_D9') print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1) print('\n\n')
scat1 = seaborn.regplot(x="W1_A11_RECODE", y="W1_D9", fit_reg=True, data=sub2) plt.xlabel('Times Watch National News') plt.ylabel('Rating for Hillary Clinton popularity') plt.title('Scatterplot for the Association Between \n Times per week watch National News and Hillary Clinton popularity')
data_clean = sub2.dropna()
print ('Association Between Times per week watch National News and Hillary Clinton popularity') print ('Pearson co-efficient |p-value') print (scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']))
result= scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']) print('\nRSquared or Coefficient of Determination')
print(result[0]*result[0])
--------------------
OUTPUT TO EXAMINE CORRELATION
--------------------
Association Between Times per week watch National News and Hillary Clinton popularity Pearson co-efficient |p-value (0.12865812149374659, 1.73963015355937e-09)
RSquared or Coefficient of Determination 0.016552912226299656
------------------------
ASSESSMENT OF OUTPUT
--------------------------
The correlation coefficient assesses the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. A correlation of -1 means there is a perfect, negative linear relationship between the two variables.
In my example I calculated the correlation coefficient to assess the degree of linear relationship between number of times national news is watched each week (explanatory variable) and popularity of Hillary Clinton (response variable)
The results were a Pearson co-efficient 0.12865812149374659 and a p-value of 1.73963015355937e-09).
The Pearson co-efficient was very close to 0 which indicated a weak correlation between number of times national news is watched each week (and popularity of Hillary Clinton, while the p-value was small this was not important as correlation was weak.
TheRSquared or Coefficient of Determination was 0.016552912226299656, again very small which indicated that in less than 20% of the time that Hillary Clinton’s variability in popularity could be explained by number of times watching national news per week.
0 notes
Text
Data Management of variables on popularity, watching national news and internet access
Data Management of variables on popularity, watching national news and internet access
-----------------
Code
------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print('How many rows of data ') print(len(data)) print('How many variables or columns of data ') print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] print('Description of 3 variables \n') print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) print ('\nFrequency table for original W1_A11 - how many times watch National News') print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1) #recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) print ('\nFrequency table for new recoded variable WATCH_NAT_NEWS') print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] print ('\nFrequency table for new variable HAS_INTERNET') print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) print(c1)
print ('\nFrequency table for original variable PPNET') print('- has internet or not') print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
print ('\nFrequency table for W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
print ('\nFrequency table for variable W1_D9') print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1) print('\n\n') # quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4) print('\n\n')
#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9'])) print('\n\n')
#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')
print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')
------------------
OUTPUT
______________
How many rows of data 2294 How many variables or columns of data 436 Description of 3 variables
W1_A11 PPNET W1_D9 count 2294.000000 2294.000000 2294.000000 mean 4.190061 0.778989 90.412380 std 2.625410 0.415019 154.079581 min -1.000000 0.000000 -1.000000 25% 2.000000 1.000000 50.000000 50% 4.000000 1.000000 70.000000 75% 7.000000 1.000000 85.000000 max 8.000000 1.000000 998.000000
Frequency table for original W1_A11 - how many times watch National News 2 264 4 245 6 242 8 463 -1 11 1 516 3 284 5 146 7 123 Name: W1_A11, dtype: int64
Frequency table for original W1_A11 - how many times watch National News after recoding missing values to NaN 5.0 146 4.0 245 1.0 516 8.0 463 2.0 264 3.0 284 7.0 123 6.0 242 NaN 11 Name: W1_A11, dtype: int64
Frequency table for original new recoded variable WATCH_NAT_NEWS - how many times watch National News 1.0 1767 0.0 516 NaN 11 Name: WATCH_NAT_NEWS, dtype: int64
Frequency table for new variable HAS_INTERNET - has internet or not 0 507 1 1787 Name: HAS_INTERNET, dtype: int64
Frequency table for original variable PPNET - has internet or not counts for original PPNET 0 507 1 1787 Name: PPNET, dtype: int64
Frequency table for W1_D9 - Hillary Clinton popularity counts for original W1_D9 0 108 2 3 4 2 6 1 10 8 12 1 20 7 22 1 30 78 36 1 38 1 40 94 50 188 58 1 60 194 62 3 68 1 70 314 72 2 74 1 76 1 80 51 88 1 90 63 92 1 94 3 96 1 98 3 100 343 998 62 -1 51 3 2 5 7 7 1 15 94 25 7 29 1 35 5 37 1 45 4 55 11 57 1 59 2 65 21 69 3 75 48 79 2 85 456 87 3 89 2 91 1 93 1 95 27 97 1 99 4 Name: W1_D9, dtype: int64
Frequency table for variable W1_D9 - Hillary Clinton popularity after missing values -1 and 998 converted to Nan 0.0 108 80.0 51 100.0 343 60.0 194 50.0 188 40.0 94 30.0 78 NaN 113 15.0 94 5.0 7 59.0 2 2.0 3 72.0 2 10.0 8 92.0 1 76.0 1 25.0 7 4.0 2 3.0 2 20.0 7 36.0 1 7.0 1 62.0 3 6.0 1 38.0 1 88.0 1 58.0 1 96.0 1 68.0 1 29.0 1 12.0 1 22.0 1 65.0 21 95.0 27 89.0 2 99.0 4 97.0 1 70.0 314 85.0 456 90.0 63 55.0 11 87.0 3 45.0 4 37.0 1 57.0 1 94.0 3 74.0 1 35.0 5 98.0 3 75.0 48 69.0 3 79.0 2 93.0 1 91.0 1 Name: W1_D9, dtype: int64
Popularity - 4 categories - quartiles 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64
W1_D9 0.0 2.0 3.0 ... 98.0 99.0 100.0 POPULARITY4 ... 1=0-25 Percentile 108 3 2 ... 0 0 0 3=50-75 Percentile 0 0 0 ... 0 0 0 2=25-50 Percentile 0 0 0 ... 0 0 0 4=75-100 Percentile 0 0 0 ... 3 4 343
[4 rows x 53 columns]
counts for POPULARITY4 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64
percentages for POPULARITY4 1=0-25 Percentile 0.281981 2=25-50 Percentile 0.252636 3=50-75 Percentile 0.257221 4=75-100 Percentile 0.208161 Name: POPULARITY4, dtype: float64
------------------------------
SUMMARY
________________________
Distribution of the data for each of these variables is output as well as distribution after variable values have been adjusted to account for missing values
Distribution of newly created and recoded variables also output.
The variable W1_A11- how many times national news watched in week takes values between -1 and 8, -1 indicates missing value, 8 is the most popular response 468 times and indicates news watched daily, there are 11 missing values
The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access
The variable W1_D9 has values between -1 and 100, it also has the value 998. The values -1 and 998 indicate missing values. Of the data available 615 is lower than 25, 551 are between 25 and 50, 561 are between 50 and 75 and 454 are between 75 and 100
Change missing values to python nan
Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes
Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages
0 notes
Text
Data Management of variables on popularity, watching national news and internet access
Data Management of variables on popularity, watching national news and internet access
Code
------------------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print('How many rows of data ') print(len(data)) print('How many variables or columns of data ') print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('counts for original W1_A11') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan)
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) sub2['HAS_INTERNET']=sub2['PPNET']
print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
# quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4)
#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9']))
#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')
print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')
Output
------------------------------------------------------------------------
How many rows of data 2294 How many variables or columns of data 436 W1_A11 PPNET W1_D9 count 2294.000000 2294.000000 2294.000000 mean 4.190061 0.778989 90.412380 std 2.625410 0.415019 154.079581 min -1.000000 0.000000 -1.000000 25% 2.000000 1.000000 50.000000 50% 4.000000 1.000000 70.000000 75% 7.000000 1.000000 85.000000 max 8.000000 1.000000 998.000000 counts for original W1_A11 2 264 4 245 6 242 8 463 -1 11 1 516 3 284 5 146 7 123 Name: W1_A11, dtype: int64 counts for original PPNET 0 507 1 1787 Name: PPNET, dtype: int64 counts for original W1_D9 0 108 2 3 4 2 6 1 10 8 12 1 20 7 22 1 30 78 36 1 38 1 40 94 50 188 58 1 60 194 62 3 68 1 70 314 72 2 74 1 76 1 80 51 88 1 90 63 92 1 94 3 96 1 98 3 100 343 998 62 -1 51 3 2 5 7 7 1 15 94 25 7 29 1 35 5 37 1 45 4 55 11 57 1 59 2 65 21 69 3 75 48 79 2 85 456 87 3 89 2 91 1 93 1 95 27 97 1 99 4 Name: W1_D9, dtype: int64 Popularity - 4 categories - quartiles 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64 W1_D9 0.0 2.0 3.0 ... 98.0 99.0 100.0 POPULARITY4 ... 1=0-25 Percentile 108 3 2 ... 0 0 0 3=50-75 Percentile 0 0 0 ... 0 0 0 2=25-50 Percentile 0 0 0 ... 0 0 0 4=75-100 Percentile 0 0 0 ... 3 4 343
[4 rows x 53 columns] counts for POPULARITY4 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64
percentages for POPULARITY4 1=0-25 Percentile 0.281981 2=25-50 Percentile 0.252636 3=50-75 Percentile 0.257221 4=75-100 Percentile 0.208161 Name: POPULARITY4, dtype: float64
--------------------------------------
Read in data set - see how many rows of data and columns/variables
Create a copy of the data which holds the three variables of interest
Look at the distribution of the data for each of these variables.
The variable W1_A11- how many times national news watched in week takes values between -1 and 8, -1 indicates missing value, 8 is the most popular response 468 times and indicates news watched daily, there are 11 missing values
The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access
The variable W1_D9 has values between -1 and 100, it also has the value 998. The values -1 and 998 indicate missing values. Of the data available 615 is lower than 25, 551 are between 25 and 50, 561 are between 50 and 75 and 454 are between 75 and 100
Change missing values to python nan
Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes
Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages
0 notes
Text
Data Analysis Tools Assignment 2
Assignment 2 Data Analysis Tools
Explanation :-
Using data from outlook on life surveys 2012 determine if there a link between watching national news and having internet access that is statistically significant are the two variables independent or dependant ?
Is the rate of watching national news equal or not equal for those that do and do not have internet access
Using data from outlook on life surveys 2012.
PPNet : HH Internet access
WW_A11 : How many days last week did you watch national news on the television or internet
1 non
2 one
…
-1 refused
Ho Having internet access effects whether you watch national news
Ha Having internet access does not effect whether you watch national news
1: Change quantifiable data (number of times per week watched national news to categorical - did or did not watch national news)
2: Did a chi squared test
3: Graphed results - post hoc test not nessecary as both variables had only two values
Since the P-value (0.0005) is less than the significance level (0.05), we cannot accept the null hypothesis therefore there is a statistically significant link between having internet access and watching tv.
------------------------------------------------------
Code :-
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import scipy.stats as scipy import seaborn import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET']].dropna() print (sub1.describe())
# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['PPNET']= pandas.to_numeric(sub1['PPNET'])
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub1['WATCH_NAT_NEWS']= sub1['W1_A11'].map(recode1) sub1['HAS_INTERNET']=sub1['PPNET']
# Output the number of people in each category of times per week grouped = sub1.groupby('WATCH_NAT_NEWS') print (grouped.count())
# create contingency table of observed counts contingency_table =pandas.crosstab(sub1['WATCH_NAT_NEWS'], sub1['HAS_INTERNET']) print (contingency_table)
# calculate column percentages - percentages of those who have internet that do/don't watch national news # percentages of those who do not have internet that do/don't watch national news col_totals =contingency_table.sum(axis=0) percentages =contingency_table/col_totals print(percentages)
# calculate chi-square print ('chi-square value, p value, expected counts')
chi_square = scipy.chi2_contingency(contingency_table) print (chi_square)
# set variable types sub1["HAS_INTERNET"] = sub1["HAS_INTERNET"].astype('category') # new code for setting variables to numeric: sub1['WATCH_NAT_NEWS'] = pandas.to_numeric(sub1['WATCH_NAT_NEWS'], errors='coerce')
# old code for setting variables to numeric: #sub2['TAB12MDX'] = sub2['TAB12MDX'].convert_objects(convert_numeric=True)
# graph percent with nicotine dependence within each smoking frequency group seaborn.factorplot(x="HAS_INTERNET", y="WATCH_NAT_NEWS", data=sub1, kind="bar", ci=None) plt.xlabel('Has Internet') plt.ylabel('Proportion Watch National News')
----------------------------------------
OUTPUT
----------------------------------------
2294 436 W1_A11 PPNET count 2294.000000 2294.000000 mean 4.190061 0.778989 std 2.625410 0.415019 min -1.000000 0.000000 25% 2.000000 1.000000 50% 4.000000 1.000000 75% 7.000000 1.000000 max 8.000000 1.000000 W1_A11 PPNET HAS_INTERNET WATCH_NAT_NEWS 0 527 527 527 1 1767 1767 1767 HAS_INTERNET 0 1 WATCH_NAT_NEWS 0 146 381 1 361 1406 HAS_INTERNET 0 1 WATCH_NAT_NEWS 0 0.287968 0.213206 1 0.712032 0.786794 chi-square value, p value, expected counts (12.056067718516204, 0.0005162407168329077, 1, array([[ 116.47297297, 410.52702703], [ 390.52702703, 1376.47297297]])) Out[23]: Text(6.8,0.5,'Proportion Watch National News') 
0 notes
Text
Data Mgmt and Visualisation Assignment 2
Th1: Program
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
import pandas import numpy
data = pandas.read_csv('ool_pds.csv',low_memory=0)
print() print() print('The number of observations in the data is ') print(len(data)) print('The number of variables in the data is ') print (len(data.columns))
sub1 = data[['PPNET','W1_A11','W1_D9']]
#print(data.describe())
#print(data['PPNET'].describe()) # Convert the data from variables on W1-D9 opinion of hilary clinton # Access to internet PPNET and # number of times watch news per week W1_A11
data['PPNET']= pandas.to_numeric(data['PPNET']) data['W1_A11']= pandas.to_numeric(data['W1_A11']) data['W1_D9']= pandas.to_numeric(data['W1_D9'])
# Find percentage of observations thatindicate internet access ct1 = data.groupby('PPNET').size() pt1 = data.groupby('PPNET').size()*100/len(data)
c1 = data['PPNET'].value_counts(sort=0,dropna=False) print('\n\nNumber of respondants with internet access ') print ('0 indicates no access - 1 indicates access') print(c1)
# Print blank lines print() print() print('Percentage of respondants that have internet access ') print ('0 indicates no access - 1 indicates access') print(pt1)
# Number of times you watched national news on internet or # TV last week # 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times # -1 indicates no response c2 = data['W1_A11'].value_counts(dropna=False) print('\n\n Number of times respondants watched news on internet or TV last week') print(' 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times') print(' -1 indicates no response') print(c2)
# Popularity of Hilary Clinton with respondant indicated by percentage # -1 indicates no response print ('\n\nPopularity of Hilary Clinton with respondant indicated by percentage') print ('Number of respondants that gave percentage in second column\n') print (' -1 indicates no response') c3 = data['W1_D9'].value_counts(dropna=False) print('rate hillary clinton ') print(c3)
Output :
runfile('/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python/c1_assignment2', wdir='/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python')
The number of observations in the data is 2294 The number of variables in the data is 436
Number of respondants with internet access 0 indicates no access - 1 indicates access 0 507 1 1787 Name: PPNET, dtype: int64
Percentage of respondants that have internet access 0 indicates no access - 1 indicates access PPNET 0 22.101133 1 77.898867 dtype: float64
Number of times respondants watched news on internet or TV last week 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times -1 indicates no response 1 516 8 463 3 284 2 264 4 245 6 242 5 146 7 123 -1 11 Name: W1_A11, dtype: int64
Popularity of Hilary Clinton with respondant indicated by percentage Number of respondants that gave percentage in second column
-1 indicates no response rate hillary clinton 85 456 100 343 70 314 60 194 50 188 0 108 15 94 40 94 30 78 90 63 998 62 80 51 -1 51 75 48 95 27 65 21 55 11 10 8 5 7 20 7 25 7 35 5 99 4 45 4 62 3 94 3 2 3 87 3 69 3 98 3 79 2 72 2 4 2 89 2 59 2 3 2 74 1 57 1 29 1 88 1 6 1 93 1 12 1 91 1 22 1 36 1 76 1 38 1 37 1 96 1 58 1 97 1 7 1 68 1 92 1 Name: W1_D9, dtype: int64
The output gives the distribution of 3 variables in the input data.
The first is internet access, whether the number of respondents that have or have not internet access, respondants that have uninitialised responses are ignored. As there are as many values in distribution as in total observations all respondents have a value for this field.
The second is number of times watch national news on tv or internet during the week, the second column gives the number of respondents the first column indicates number of times per week. Uninitialised values from respondent are ignored.
The third distribution gives Popularity of Hilary Clinton with group of respondants indicated by percentage, Number of respondants that gave percentage in second column
0 notes
Text
Data Analysis Tools Assignment 1
Data Analysis Tools Assignment 1
Model Interpretation for ANOVA:
When examining the outlook on life 2012 data set and the association between number of days watched national news program on tv or internet (categorical explanatory variable) and the opinion of Hillary Clinton (quantitative response) , an Analysis of Variance (ANOVA) revealed that , there was a statistically significant association between number of times that a person watched the national news weekly and their opinion of Hillary Clinton. See results below
The model interpretation for posthoc Anova results (Tukey) indicated that those people who watched national news 4,5, or 7 times a week reported significantly more positive opinion of Hillary Clinton compared to those that watched the national news 0 or 1 time per week all other comparisons were statistically similar. See code for analysis and results below
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))
# create new view of data frame with nulls removed # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','W1_D9']].dropna() print (sub1)
# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['W1_D9']= pandas.to_numeric(sub1['W1_D9'])
# Output the number of people in each category of times per week grouped = sub1.groupby('W1_A11') print (grouped.count())
# Output the average opinion of Hillary Clinton of people # in each category of times per week meanOpinion = sub1.groupby('W1_A11').mean() print(meanOpinion)
# Output the standard deviation in opinion of Hillary Clinton of people # in each category of times per week stdOpinion = sub1.groupby('W1_A11').std() print(stdOpinion)
#Use statistical function ols to calculate ANOVA # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable opinion_model = smf.ols(formula ='W1_D9 ~ C(W1_A11)', data=sub1).fit() print (opinion_model.summary())
# Having got p-value less than 0.05 find what does # difference between means in model mean ... # As explanatory variable has more than two groups need to determine # which groups different from the others # need to perform post hoc test and avoid family wise type 1 error so # use Tukey's honestly significant difference test
mc = multi.MultiComparison(sub1['W1_D9'], sub1['W1_A11']) res1 = mc.tukeyhsd() print(res1.summary())
0 notes
Text
Data Mgmt and Visualisation Assignment 1
I have selected the Outlook on Life Dataset 2012 produced by Belinda Robnett and Katherine Tate
My research is to determine is there a negative link between having internet access and popularity of Hilary Clinton and secondly if watching news on tv or the internet is an indicator for popularity of Hilary Clinton.
1. Hypothesis to test if internet access is a predictor of popularity for Hilary Clinton
2. Hypothesis that watching news on tv or internet is a predictor of popularity for Hilary Clinton
A search of academic articles using the search terms ‘Hillary Clinton Voter News Source’ generated 4180 results.
According to Boxell and Gentzkow, 2017 actual data gathered from Trump voters after the 2016 presidential election did not indicate that ‘internet media and online campaign methods conferred an advantage to Trump compared to other Republican presidential candidates in the internet era’ as is often suggested in discussion about Cambridge analytics and Russian influence.
They found that ‘Relative to prior years, the Republican share of the vote in 2016 was as high or higher among the groups least active online.’
Other research indicates that Russian intervention, and Republicans’ success in “marrying content with delivery and data” (Johnson 2017) online many have influenced the election Others have emphasized the Trump campaign’s use of data to target messages online (Confessore and Hakim 2017) was successful.
Generally it is suggested that the publics opinion on Hilary Clinton was influenced negatively on social media. This research will investigate if having access to the internet in 2012 was an indicator of popularity for Hilary Clinton.
In the research I will use the categorical data on whether a person had internet access or not (PPNET - Internet Access) and the quantifiable data on participants’ opinion of Hillary Clinton(w1-D9 ‘ How would you rate Hillary Clinton’) to see if there is a link between, internet access and opinion of Hillary Clinton to support my first hypothesis ‘Internet access is an indicator of popularity for Hillary Clinton’ . The unique identifier case-id will be used to distinguish each participant in the survey. The w1_a11 - ‘how many days did you watch national news programs on television or internet’ may be used of further research for second hypothesis to see link between sources of news and opinion of Hillary Clinton - w1_D9 - ‘ How would you rate Hillary Clinton’. Pew Research Centre found that the main source of news for Trump Voters was Fox News.
Confessore, Nicholas and Danny Hakim. 2017. Data firm says ‘secret sauce’ aided Trump; Many scoff. New York Times. Available at https://www.nytimes.com/2017/03/06/us/politics/cambridge-analytica.html. Accessed September 14, 2017.
Hampton, Keith N. and Eszter Hargittai. 2016. Stop blaming Facebook for Trump’s election win. The Hill. Available at http://thehill.com/blogs/pundits-blog/presidential-campaign/307438-stop-blaming-facebook- for-trumps-election-win. Accessed June 14, 2017.
Johnson, Eric. 2017. Full transcript: Hillary Clinton at Code 2017. recode.net. Available at https://www.recode.net/2017/5/31/15722218/hillary-clinton-code-conference-transcript-donald- trump-2016-russia-walt-mossberg-kara-swisher. Accessed September 21, 2017.
Shapiro, Jesse M. Gentzkow, Matthew, 2017. A Note on Internet Use and the 2016 Election Outcome, Brown University and NBER September 2017
Pew Research Centre, 2017, Trump, Clinton Voters Divided in Their Main Source for Election News, Journalism & Media, Pew Research Centre January 18 2017
1 note
·
View note