oonaghmaryobrien-blog
oonaghmaryobrien-blog
Statistics
9 posts
Don't wanna be here? Send us removal request.
oonaghmaryobrien-blog · 7 years ago
Text
Assignment 4 Data Analysis and Visualisation
Code :-
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy import seaborn import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] #print('Description of 3 variables \n') #print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
#print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) #print ('\nFrequency table for original W1_A11 - how many times watch National News') #print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)
#recoding values for W1_A11 into a new variable, TIMES_WATCH_NAT_NEWS - 1..7 recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7, -1:0} sub2['TIMES_WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1)
#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["TIMES_WATCH_NAT_NEWS"] = sub2["TIMES_WATCH_NAT_NEWS"].astype('category')
seaborn.countplot(x="TIMES_WATCH_NAT_NEWS", data=sub2); plt.xlabel('Number of times watch national news per week ') plt.ylabel('Number of responses') plt.title('Univariate Graph of Number of times watch national news per week') fig = plt.gcf() fig.savefig('TIMES_WATCH_NAT_NEWS') plt.show()
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) #print ('\nFrequency table for  new recoded variable WATCH_NAT_NEWS') #print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) #print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] #print ('\nFrequency table for new variable HAS_INTERNET') #print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) #print(c1)
print ('\nFrequency table for original variable PPNET') print('- has internet or not') print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["PPNET"] = sub2["PPNET"].astype('category')
seaborn.countplot(x="PPNET", data=sub2); plt.xlabel('Has internet or not - 0 indicate no, 1 indicates yes') plt.ylabel('Number of responses') plt.title('Univariate Graph of Internet access or not') fig = plt.gcf() fig.savefig('HASINTERNET') plt.show()
#print ('\nFrequency table for  W1_D9') #print('- Hillary Clinton popularity') #print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
#print ('\nFrequency table for  variable W1_D9') #print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1) #print('\n\n')
#Univariate histogram for quantitative variable: seaborn.distplot(sub2["W1_D9"].dropna(), kde=False); plt.xlabel('Popularity of Hillary Clinton') plt.ylabel('Number of responses') plt.title('Univariate Graph of Popularity of Hillary Clinton') fig = plt.gcf() fig.savefig('HILLARYPOPULARITY') plt.show()
sub2["W1_D9"].dropna()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="TIMES_WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Times watch National News from life dataset 2012 study') plt.xlabel('Times watch national news') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYTIMES') plt.show()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to whether watch national news from life dataset 2012 study') plt.xlabel('Watch national news- 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYNEWS') plt.show()
# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="HAS_INTERNET", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Internet Access from life dataset 2012 study') plt.xlabel('Internet Access') plt.xlabel('Has Internet - 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYINTERNET') plt.show()
The univariate graph of number of times watch TV per week:
Tumblr media
This is a unimodal graph , with mode at its highest peak at the watching national news 0 times per week. The second highest peak is at 7 times perweek  It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories except for the final category 7 times per week which is second highest.
The univariate graph of the categorical variable internet access.
Tumblr media
This graph is unimodal, with its highest peak at having internet access, with 1750 respondants of approximately 2250 having internet access.
The univariate graph of quantifiable variable Hillary Clinton Popularity:
Tumblr media
This graph is unimodal, with its highest peak at the category of approximately 85*% popularity for 460 respondants. The graph seems to be skewed to the left as higher responses for greater values in popularity.
Bivariate Graph of Number of times watch National News with Hillary Clinton Popularity
Tumblr media
The graph above plots the number of times watch national news per week to level of Hillary Clinton popularity. We can see that the bar chart graph does not show a clear relationship/trend between the two variables, although the graph is slightly skewed to the left as the popularity increases as number of times watching tv increases.
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Assignment 4
Code :-
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 07:53:09 2018
@author: oonagh.obrien
"""
# -*- coding: utf-8 -*-
"""
Created on Mon Sep 21 10:18:43 2015
@author: jml
"""
# ANOVA
import numpy
import pandas
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable
# Check how many rows and columns in data
data = pandas.read_csv('ool_pds.csv',low_memory=0)
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# create new view of data frame with nulls removed
# include
# W1_D9 Hillary Clinton Popularity
# W1_C1 - Republican (1), Democrat (2), Independant (3), Something Else (4), Refused (-1)
# W1_B4 - Extremely Angry (1), Very Angry (2), Somewhat Angry (3),
# A little Angry (4), Not Angry at all (5), Refused (-1)
sub1 = data[['W1_C1','W1_B4','W1_D9']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
# Convert data to numberic
sub2['W1_C1']= pandas.to_numeric(sub2['W1_C1'])
sub2['W1_B4']= pandas.to_numeric(sub2['W1_B4'])
sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
#print ('\nFrequency table for original W1_C1 - Political allegianc')
sub2['W1_C1']=sub2['W1_C1'].replace(-1, numpy.nan)
c1 = sub2['W1_C1'].value_counts(sort=False, dropna=False)
#print(c1)
# W1_D9 - Hillary Clinton Popularity
# recode missing values -1 refused or 998 don't recognise to python missing (NaN)
sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan)
sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
#print ('\nFrequency table for  variable W1_D9')
#print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan')
#print ('\nFrequency table for original W1_D9 - Hillary Clinton Popularity')
c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False)
#print(c1)
#print ('\nFrequency table for original W1_B4 - Level of anger')
#recode from increasing number to decreasing anger to increasing number increasing anger
recode1 = {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}
sub2['W1_B4']= sub2['W1_B4'].map(recode1)
sub2['W1_B4']=sub2['W1_B4'].replace(-1, numpy.nan)
c1 = sub2['W1_B4'].value_counts(sort=False, dropna=False)
#print(c1)
#Anova for popularity and political allegiance
model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=sub2).fit()
print (model1.summary())
#Anova for popularity and political allegiance
model1 = smf.ols(formula='W1_D9 ~ C(W1_B4)', data=sub2).fit()
#print (model1.summary())
groupdata1 = sub2[['W1_D9','W1_B4']].dropna()
groupdata2 = sub2[['W1_D9','W1_C1']].dropna()
#Output means for level of anger
print ('Means for level of anger')
m1= groupdata1.groupby('W1_B4').mean()
#print (m1)
#Output std dev for level of anger
print ('Std Dev for level of anger')
m1= groupdata1.groupby('W1_B4').std()
#print (m1)
#Output means for political affiliation
print ('Means for political affiliation')
m1= groupdata2.groupby('W1_C1').mean()
print (m1)
#Output std dev for political affiliation
print ('Std Dev for political affiliation')
m1= groupdata2.groupby('W1_C1').std()
print (m1)
#seaborn.factorplot(x="W1_B4", y="W1_D9", data=groupdata1, kind="bar", ci=None)
#plt.xlabel('Level of Anger')
#plt.ylabel('Hillary popularity')
#plt.title('Level of Anger v Hillary C popularity')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=groupdata2, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity')
plt.show()
# Consider moderator of level of anger - subgroup data depending on
# level of anger
# Not angry at all
grp1=sub2[(sub2['W1_B4']==1)]
# A little angry
grp2=sub2[(sub2['W1_B4']==2)]
# Somewhat angry
grp3=sub2[(sub2['W1_B4']==3)]
# Very Angry
grp4=sub2[(sub2['W1_B4']==4)]
# Extremely Angry
grp5=sub2[(sub2['W1_B4']==5)]
print ('association between Political affiliation and HC popularity for Not angry')
model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp1).fit()
print (model1.summary())
print ('association between Political affiliation and HC popularity for A little angry')
model2 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp2).fit()
print (model2.summary())
print ('association between Political affiliation and HC popularity for Somewhat angry')
model3 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp3).fit()
print (model3.summary())
print ('association between Political affiliation and HC popularity for Veryangry')
model4 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp4).fit()
print (model4.summary())
print ('association between Political affiliation and HC popularity for Extremely angry')
model5 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp5).fit()
print (model5.summary())
print ("means for Popularity by Political affiliation  Not angry")
m1= grp1.groupby('W1_C1').mean()
print (m1)
print ("means for Popularity by Political affiliation  A little angry")
m2= grp2.groupby('W1_C1').mean()
print (m2)
print ("means for Popularity by Political affiliation  Somewhat angry")
m3= grp3.groupby('W1_C1').mean()
print (m3)
print ("means for Popularity by Political affiliation  Very angry")
m4= grp4.groupby('W1_C1').mean()
print (m4)
print ("means for Popularity by Political affiliation  Extremely angry")
m5= grp5.groupby('W1_C1').mean()
print (m5)
print()
print()
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp1, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Level of Anger v Hillary C popularity not angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp2, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity a little angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp3, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Level of Anger v Hillary C somewhat angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp4, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity very angry')
seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp5, kind="bar", ci=None)
plt.xlabel('Political Affiliation')
plt.ylabel('Hillary popularity')
plt.title('Political Affiliation v Hillary C popularity very angry')
---------------------------------
Output :-
Tumblr media
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.271
Model:                            OLS   Adj. R-squared:                  0.270
Method:                 Least Squares   F-statistic:                     266.3
Date:                Sat, 21 Jul 2018   Prob (F-statistic):          6.24e-147
Time:                        10:33:39   Log-Likelihood:                -9906.3
No. Observations:                2154   AIC:                         1.982e+04
Df Residuals:                    2150   BIC:                         1.984e+04
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          39.2125      1.346    29.140      0.000      36.574      41.851
C(W1_C1)[T.2.0]    39.3153     1.514     25.976      0.000     36.347      42.283
C(W1_C1)[T.3.0]    19.8492     1.701     11.668      0.000     16.513      23.185
C(W1_C1)[T.4.0]    12.3853     2.848      4.349     ��0.000       6.801      17.970
==============================================================================
Omnibus:                      152.720   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              186.379
Skew:                          -0.677   Prob(JB):                     3.37e-41
Kurtosis:                       3.495   Cond. No.                         7.39
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Not angry
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.094
Method:                 Least Squares   F-statistic:                     10.15
Date:                Sat, 21 Jul 2018   Prob (F-statistic):           2.38e-06
Time:                        10:33:39   Log-Likelihood:                -1212.6
No. Observations:                 267   AIC:                             2433.
Df Residuals:                     263   BIC:                             2448.
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          55.3125      5.719     9.672      0.000      44.052      66.573
C(W1_C1)[T.2.0]    24.3249     5.981      4.067      0.000     12.549      36.101
C(W1_C1)[T.3.0]    11.7036     6.415      1.825      0.069     -0.927      24.334
C(W1_C1)[T.4.0]     8.4097     7.860      1.070      0.286     -7.067      23.886
==============================================================================
Omnibus:                       47.995   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               72.160
Skew:                          -1.074   Prob(JB):                     2.14e-16
Kurtosis:                       4.368   Cond. No.                         10.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for A little angry
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.191
Model:                            OLS   Adj. R-squared:                  0.187
Method:                 Least Squares   F-statistic:                     41.70
Date:                Sat, 21 Jul 2018   Prob (F-statistic):           3.36e-24
Time:                        10:33:39   Log-Likelihood:                -2386.0
No. Observations:                 533   AIC:                             4780.
Df Residuals:                     529   BIC:                             4797.
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          44.8333      3.296    13.603      0.000      38.359      51.308
C(W1_C1)[T.2.0]    33.5203     3.482      9.628      0.000     26.681      40.360
C(W1_C1)[T.3.0]    17.3363     3.865      4.486      0.000       9.744      24.928
C(W1_C1)[T.4.0]    21.0490     6.140      3.428      0.001       8.988      33.110
==============================================================================
Omnibus:                       63.753   Durbin-Watson:                   2.073
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               87.388
Skew:                          -0.864   Prob(JB):                     1.06e-19
Kurtosis:                       3.974   Cond. No.                        10.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Somewhat angry
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.278
Method:                 Least Squares   F-statistic:                     89.63
Date:                Sat, 21 Jul 2018   Prob (F-statistic):           6.07e-49
Time:                        10:33:39   Log-Likelihood:                -3099.6
No. Observations:                 690   AIC:                             6207.
Df Residuals:                     686   BIC:                             6225.
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err         t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          40.5978      2.260    17.965      0.000      36.161      45.035
C(W1_C1)[T.2.0]    37.8421     2.497     15.154      0.000     32.939      42.745
C(W1_C1)[T.3.0]    20.0949     2.817      7.133      0.000     14.564      25.626
C(W1_C1)[T.4.0]    13.1522     5.871      2.240      0.025       1.625      24.680
==============================================================================
Omnibus:                       84.449   Durbin-Watson:                   1.932
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              126.149
Skew:                          -0.843   Prob(JB):                     4.05e-28
Kurtosis:                       4.244   Cond. No.                         9.06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Veryangry
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.231
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     41.42
Date:                Sat, 21 Jul 2018   Prob (F-statistic):           2.04e-23
Time:                        10:33:39   Log-Likelihood:                -1951.3
No. Observations:                 417   AIC:                             3911.
Df Residuals:                     413   BIC:                             3927.
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          41.8913      2.730    15.343      0.000      36.524      47.258
C(W1_C1)[T.2.0]    35.3237     3.338     10.582      0.000     28.762      41.885
C(W1_C1)[T.3.0]    15.4104     3.656      4.215      0.000       8.224      22.597
C(W1_C1)[T.4.0]     9.7174     6.105      1.592      0.112     -2.284      21.719
==============================================================================
Omnibus:                       28.366   Durbin-Watson:                   1.755
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.526
Skew:                          -0.680   Prob(JB):                     8.65e-08
Kurtosis:                       3.147   Cond. No.                         6.03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between Political affiliation and HC popularity for Extremely angry
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  W1_D9   R-squared:                       0.343
Model:                            OLS   Adj. R-squared:                  0.334
Method:                 Least Squares   F-statistic:                     40.32
Date:                Sat, 21 Jul 2018   Prob (F-statistic):           5.21e-21
Time:                        10:33:39   Log-Likelihood:                -1131.3
No. Observations:                 236   AIC:                             2271.
Df Residuals:                     232   BIC:                             2284.
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
===================================================================================
                     coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          28.0800      3.402     8.253      0.000      21.377      34.783
C(W1_C1)[T.2.0]    51.2836     4.973     10.313      0.000     41.486      61.081
C(W1_C1)[T.3.0]    19.1668     4.780      4.010      0.000       9.749      28.585
C(W1_C1)[T.4.0]    -4.0244     7.733     -0.520      0.603    -19.261      11.212
==============================================================================
Omnibus:                       10.111   Durbin-Watson:                   2.182
Prob(Omnibus):                  0.006   Jarque-Bera (JB):                5.862
Skew:                       ��   0.205   Prob(JB):                       0.0533
Kurtosis:                       2.346   Cond. No.                         4.91
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for Popularity by Political affiliation  Not angry
     W1_B4      W1_D9
W1_C1                  
1.0      1.0 55.312500
2.0     1.0  79.637427
3.0     1.0  67.016129
4.0     1.0  63.722222
means for Popularity by Political affiliation  A little angry
     W1_B4      W1_D9
W1_C1                  
1.0     2.0  44.833333
2.0     2.0  78.353591
3.0     2.0  62.169643
4.0     2.0  65.882353
means for Popularity by Political affiliation  Somewhat angry
     W1_B4      W1_D9
W1_C1                  
1.0     3.0  40.597826
2.0     3.0  78.439904
3.0     3.0  60.692771
4.0     3.0  53.750000
means for Popularity by Political affiliation  Very angry
     W1_B4      W1_D9
W1_C1                  
1.0     4.0  41.891304
2.0     4.0  77.215054
3.0     4.0  57.301724
4.0     4.0  51.608696
means for Popularity by Political affiliation  Extremely angry
      W1_B4     W1_D9
W1_C1                  
1.0     5.0  28.080000
2.0     5.0  79.363636
3.0     5.0  47.246753
4.0     5.0  24.055556
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Political Affiliation v Hillary Clinton Popularity for respondents who are extremely angry
Analysis :-
Anova to examine association between Hillary Clinton's popularity rating and political affiliation, has a very small p-value and significant f-statistic which indicates a statistically significant relationship between the two variables.
Further examination of the mean, standard deviation, the mode and graphing the association indicates that highest value for popularity is when political affiliation is democrat and lowest when republican.
The variable  W1_C1- 'level of anger' was considered  as a possible moderator of the association between 'political affiliation'  and 'popularity of Hillary Clinton' and this was investigated by making a subgroup for each level of anger - not angry, somewhat angry,
Anova was done for relationship between 'political affiliation'  and 'popularity of Hillary Clinton'. It was found that every subgroup had a very small p-value and large f-statistic which indicated a statistically significant relationship between the two variables.
From examing the means and standard deviations and graphing the relationships between 'political affiliation' and 'popularity of Hillary Clinton' for each of the subgroups it appeared that 'level of anger' was a moderator for the association,, with popularity decreasing for those who were not democrats as level of anger increased.
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Generating Correlation Coefficient do determine correlation between Number of times Watch National News Per week and Hillary Clinton’s popularity
Generating Correlation Coefficient do determine correlation between Number of times Watch National News Per week and Hillary Clinton’s popularity
------------------------------
CODE
-------------------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)
# create new view of data frame with nulls removed # include quantifiable variable W1_A11 number of times per week national news watched # on tv or internet and quantifiable variable W1_D9 - Hillary Clinton popularity
sub1 = data[['W1_A11','W1_D9']] print('Description of 2 variables \n') print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) print ('\nFrequency table for original W1_A11 - how many times watch National News') print('after recoding missing values to NaN')
#Recode values for times watched from 0 to 7 rather than 1 to 8 as easier #to interpet on graph recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7} sub2['W1_A11_RECODE']= sub2['W1_A11'].map(recode1) c1 = sub2['W1_A11_RECODE'].value_counts(sort=False, dropna=False) print(c1)
print ('\nFrequency table for  W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
print ('\nFrequency table for  variable W1_D9') print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1) print('\n\n')
scat1 = seaborn.regplot(x="W1_A11_RECODE", y="W1_D9", fit_reg=True, data=sub2) plt.xlabel('Times Watch National News') plt.ylabel('Rating for Hillary Clinton popularity') plt.title('Scatterplot for the Association Between \n   Times per week watch National News and Hillary Clinton popularity')
data_clean = sub2.dropna()
print ('Association Between Times per week watch National News and Hillary Clinton popularity') print ('Pearson co-efficient |p-value') print (scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']))
result= scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']) print('\nRSquared or Coefficient of Determination')
print(result[0]*result[0])
--------------------
OUTPUT TO EXAMINE CORRELATION
--------------------
Tumblr media
Association Between Times per week watch National News and Hillary Clinton popularity Pearson co-efficient |p-value (0.12865812149374659, 1.73963015355937e-09)
RSquared or Coefficient of Determination 0.016552912226299656
------------------------
ASSESSMENT OF OUTPUT
--------------------------
The correlation coefficient assesses the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. A correlation of -1 means there is a perfect, negative linear relationship between the two variables.
In my example I calculated the  correlation coefficient to assess the degree of linear relationship between number of times national news is watched each week (explanatory variable) and popularity of Hillary Clinton (response variable)
The results were a Pearson co-efficient 0.12865812149374659 and a p-value of 1.73963015355937e-09).
The Pearson co-efficient was very close to 0 which indicated a weak correlation between  number of times national news is watched each week (and popularity of Hillary Clinton, while the p-value was small this was not important as correlation was weak.
TheRSquared or Coefficient of Determination was 0.016552912226299656, again very small which indicated that in less than 20% of the time that Hillary Clinton’s variability in popularity could be explained by number of times watching national news per week.
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Management of variables on popularity, watching national news and internet access
Data Management of variables on popularity, watching national news and internet access
-----------------
Code
------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print('How many rows of data ') print(len(data)) print('How many variables or columns of data ') print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] print('Description of 3 variables \n') print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) print ('\nFrequency table for original W1_A11 - how many times watch National News') print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1) #recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) print ('\nFrequency table for  new recoded variable WATCH_NAT_NEWS') print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] print ('\nFrequency table for new variable HAS_INTERNET') print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) print(c1)
print ('\nFrequency table for original variable PPNET') print('- has internet or not') print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
print ('\nFrequency table for  W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
print ('\nFrequency table for  variable W1_D9') print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1) print('\n\n') # quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4) print('\n\n')
#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9'])) print('\n\n')
#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')
print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')
------------------
OUTPUT
______________
How many rows of data 2294 How many variables or columns of data 436 Description of 3 variables
           W1_A11        PPNET        W1_D9 count  2294.000000  2294.000000  2294.000000 mean      4.190061     0.778989    90.412380 std       2.625410     0.415019   154.079581 min      -1.000000     0.000000    -1.000000 25%       2.000000     1.000000    50.000000 50%       4.000000     1.000000    70.000000 75%       7.000000     1.000000    85.000000 max       8.000000     1.000000   998.000000
Frequency table for original W1_A11 - how many times watch National News 2    264 4    245 6    242 8    463 -1     11 1    516 3    284 5    146 7    123 Name: W1_A11, dtype: int64
Frequency table for original W1_A11 - how many times watch National News after recoding missing values to NaN 5.0    146 4.0    245 1.0    516 8.0    463 2.0    264 3.0    284 7.0    123 6.0    242 NaN      11 Name: W1_A11, dtype: int64
Frequency table for original new recoded variable WATCH_NAT_NEWS - how many times watch National News 1.0    1767 0.0     516 NaN       11 Name: WATCH_NAT_NEWS, dtype: int64
Frequency table for new variable HAS_INTERNET - has internet or not 0     507 1    1787 Name: HAS_INTERNET, dtype: int64
Frequency table for original variable PPNET - has internet or not counts for original PPNET 0     507 1    1787 Name: PPNET, dtype: int64
Frequency table for  W1_D9 - Hillary Clinton popularity counts for original W1_D9 0      108 2        3 4        2 6        1 10       8 12       1 20       7 22       1 30      78 36       1 38       1 40      94 50     188 58       1 60     194 62       3 68       1 70     314 72       2 74       1 76       1 80      51 88       1 90      63 92       1 94       3 96       1 98       3 100    343 998     62 -1       51 3        2 5        7 7        1 15      94 25       7 29       1 35       5 37       1 45       4 55      11 57       1 59       2 65      21 69       3 75      48 79       2 85     456 87       3 89       2 91       1 93       1 95      27 97       1 99       4 Name: W1_D9, dtype: int64
Frequency table for  variable W1_D9 - Hillary Clinton popularity after missing values -1 and 998 converted to Nan 0.0      108 80.0      51 100.0    343 60.0     194 50.0     188 40.0      94 30.0      78 NaN       113 15.0      94 5.0        7 59.0       2 2.0        3 72.0       2 10.0       8 92.0       1 76.0       1 25.0       7 4.0        2 3.0        2 20.0       7 36.0       1 7.0        1 62.0       3 6.0        1 38.0       1 88.0       1 58.0       1 96.0       1 68.0       1 29.0       1 12.0       1 22.0       1 65.0      21 95.0      27 89.0       2 99.0       4 97.0       1 70.0     314 85.0     456 90.0      63 55.0      11 87.0       3 45.0       4 37.0       1 57.0       1 94.0       3 74.0       1 35.0       5 98.0       3 75.0      48 69.0       3 79.0       2 93.0       1 91.0       1 Name: W1_D9, dtype: int64
Popularity - 4 categories - quartiles 1=0-25 Percentile      615 2=25-50 Percentile     551 3=50-75 Percentile     561 4=75-100 Percentile    454 Name: POPULARITY4, dtype: int64
W1_D9                0.0    2.0    3.0    ...    98.0   99.0   100.0 POPULARITY4                               ...                       1=0-25 Percentile      108      3      2  ...        0      0      0 3=50-75 Percentile       0      0      0  ...        0      0      0 2=25-50 Percentile       0      0      0  ...        0      0      0 4=75-100 Percentile      0      0      0  ...        3      4    343
[4 rows x 53 columns]
counts for POPULARITY4 1=0-25 Percentile      615 2=25-50 Percentile     551 3=50-75 Percentile     561 4=75-100 Percentile    454 Name: POPULARITY4, dtype: int64
percentages for POPULARITY4 1=0-25 Percentile      0.281981 2=25-50 Percentile     0.252636 3=50-75 Percentile     0.257221 4=75-100 Percentile    0.208161 Name: POPULARITY4, dtype: float64
------------------------------
SUMMARY
________________________
Distribution of the data for each of these variables is output as well as distribution after variable values have been adjusted to account for missing values
Distribution of newly created and recoded variables also output.
The variable W1_A11- how many times national news watched in week takes values between -1 and 8, -1 indicates missing value, 8 is the most popular response 468 times and indicates news watched daily, there are 11 missing values
The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access
The variable W1_D9 has values between -1 and 100, it also has the  value 998. The values -1 and 998 indicate missing values. Of the data available     615 is lower than 25, 551 are between 25 and 50, 561 are between 50 and 75 and 454 are between 75 and 100
Change missing values to python nan
Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes
Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Management of variables on popularity, watching national news and internet access
Data Management of variables on popularity, watching national news and internet access
Code 
------------------------------
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import numpy
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print('How many rows of data ') print(len(data)) print('How many variables or columns of data ') print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET','W1_D9']] print (sub1.describe())
#make a copy of my new subsetted data sub2 = sub1.copy()
# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])
print ('counts for original W1_A11') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan)
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) sub2['HAS_INTERNET']=sub2['PPNET']
print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all
print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)
# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)
# quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4)
#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9']))
#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')
print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')
Output 
------------------------------------------------------------------------
How many rows of data 2294 How many variables or columns of data 436            W1_A11        PPNET        W1_D9 count  2294.000000  2294.000000  2294.000000 mean      4.190061     0.778989    90.412380 std       2.625410     0.415019   154.079581 min      -1.000000     0.000000    -1.000000 25%       2.000000     1.000000    50.000000 50%       4.000000     1.000000    70.000000 75%       7.000000     1.000000    85.000000 max       8.000000     1.000000   998.000000 counts for original W1_A11 2    264 4    245 6    242 8    463 -1     11 1    516 3    284 5    146 7    123 Name: W1_A11, dtype: int64 counts for original PPNET 0     507 1    1787 Name: PPNET, dtype: int64 counts for original W1_D9 0      108 2        3 4        2 6        1 10       8 12       1 20       7 22       1 30      78 36       1 38       1 40      94 50     188 58       1 60     194 62       3 68       1 70     314 72       2 74       1 76       1 80      51 88       1 90      63 92       1 94       3 96       1 98       3 100    343 998     62 -1       51 3        2 5        7 7        1 15      94 25       7 29       1 35       5 37       1 45       4 55      11 57       1 59       2 65      21 69       3 75      48 79       2 85     456 87       3 89       2 91       1 93       1 95      27 97       1 99       4 Name: W1_D9, dtype: int64 Popularity - 4 categories - quartiles 1=0-25 Percentile      615 2=25-50 Percentile     551 3=50-75 Percentile     561 4=75-100 Percentile    454 Name: POPULARITY4, dtype: int64 W1_D9                0.0    2.0    3.0    ...    98.0   99.0   100.0 POPULARITY4                               ...                       1=0-25 Percentile      108      3      2  ...        0      0      0 3=50-75 Percentile       0      0      0  ...        0      0      0 2=25-50 Percentile       0      0      0  ...        0      0      0 4=75-100 Percentile      0      0      0  ...        3      4    343
[4 rows x 53 columns] counts for POPULARITY4 1=0-25 Percentile      615 2=25-50 Percentile     551 3=50-75 Percentile     561 4=75-100 Percentile    454 Name: POPULARITY4, dtype: int64
percentages for POPULARITY4 1=0-25 Percentile      0.281981 2=25-50 Percentile     0.252636 3=50-75 Percentile     0.257221 4=75-100 Percentile    0.208161 Name: POPULARITY4, dtype: float64
--------------------------------------
Read in data set - see how many rows of data and columns/variables
Create a copy of the data which holds the three variables of interest
Look at the distribution of the data for each of these variables.
The variable W1_A11- how many times national news watched in week takes values between -1 and 8, -1 indicates missing value, 8 is the most popular response 468 times and indicates news watched daily, there are 11 missing values
The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access
The variable W1_D9 has values between -1 and 100, it also has the  value 998. The values -1 and 998 indicate missing values. Of the data available     615 is lower than 25, 551 are between 25 and 50, 561 are between 50 and 75 and 454 are between 75 and 100
Change missing values to python nan
Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes
Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Analysis Tools Assignment 2
Assignment 2 Data Analysis Tools 
Explanation :-
Using data from outlook on life surveys 2012 determine if there a link between watching national news and having internet access that is statistically significant are the two variables independent or dependant ?
Is the rate of watching national news equal or not equal for those that do and do not have internet access
Using data from outlook on life surveys 2012.
PPNet : HH Internet access
WW_A11 : How many days last week did you watch national news on the television or internet
1 non
2 one
-1 refused
Ho Having internet access effects whether you watch national news
Ha Having internet access does not effect whether you watch national news
1: Change quantifiable data (number of times per week watched national news to categorical - did or did not watch national news)
2: Did a chi squared test
3: Graphed results - post hoc test not nessecary as both variables had only two values
Since the P-value (0.0005) is less than the significance level (0.05), we cannot accept the null hypothesis therefore there is a statistically significant link between having internet access and watching tv.
------------------------------------------------------
Code :-
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas import scipy.stats as scipy import seaborn import matplotlib.pyplot as plt
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))
# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','PPNET']].dropna() print (sub1.describe())
# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['PPNET']= pandas.to_numeric(sub1['PPNET'])
#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub1['WATCH_NAT_NEWS']= sub1['W1_A11'].map(recode1) sub1['HAS_INTERNET']=sub1['PPNET']
# Output the number of people in each category of times per week grouped = sub1.groupby('WATCH_NAT_NEWS') print (grouped.count())
# create contingency table of observed counts contingency_table =pandas.crosstab(sub1['WATCH_NAT_NEWS'], sub1['HAS_INTERNET']) print (contingency_table)
# calculate column percentages - percentages of those who have internet that do/don't watch national news #                                percentages of those who do not have internet that do/don't watch national news col_totals =contingency_table.sum(axis=0) percentages =contingency_table/col_totals print(percentages)
# calculate chi-square print ('chi-square value, p value, expected counts')
chi_square = scipy.chi2_contingency(contingency_table) print (chi_square)
# set variable types sub1["HAS_INTERNET"] = sub1["HAS_INTERNET"].astype('category') # new code for setting variables to numeric: sub1['WATCH_NAT_NEWS'] = pandas.to_numeric(sub1['WATCH_NAT_NEWS'], errors='coerce')
# old code for setting variables to numeric: #sub2['TAB12MDX'] = sub2['TAB12MDX'].convert_objects(convert_numeric=True)
# graph percent with nicotine dependence within each smoking frequency group seaborn.factorplot(x="HAS_INTERNET", y="WATCH_NAT_NEWS", data=sub1, kind="bar", ci=None) plt.xlabel('Has Internet') plt.ylabel('Proportion Watch National News')
----------------------------------------
OUTPUT
----------------------------------------
2294 436            W1_A11        PPNET count  2294.000000  2294.000000 mean      4.190061     0.778989 std       2.625410     0.415019 min      -1.000000     0.000000 25%       2.000000     1.000000 50%       4.000000     1.000000 75%       7.000000     1.000000 max       8.000000     1.000000                W1_A11  PPNET  HAS_INTERNET WATCH_NAT_NEWS                             0                  527    527           527 1                 1767   1767          1767 HAS_INTERNET      0     1 WATCH_NAT_NEWS           0               146   381 1               361  1406 HAS_INTERNET           0         1 WATCH_NAT_NEWS                     0               0.287968  0.213206 1               0.712032  0.786794 chi-square value, p value, expected counts (12.056067718516204, 0.0005162407168329077, 1, array([[ 116.47297297,  410.52702703],       [ 390.52702703, 1376.47297297]])) Out[23]: Text(6.8,0.5,'Proportion Watch National News') 
Tumblr media
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Mgmt and Visualisation Assignment 2
Th1: Program
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
import pandas import numpy
data = pandas.read_csv('ool_pds.csv',low_memory=0)
print() print() print('The number of observations in the data is ') print(len(data)) print('The number of variables in the data is ') print (len(data.columns))
sub1 = data[['PPNET','W1_A11','W1_D9']]
#print(data.describe())
#print(data['PPNET'].describe()) # Convert the data from variables on W1-D9 opinion of hilary clinton # Access to internet PPNET and # number of times watch news per week W1_A11
data['PPNET']= pandas.to_numeric(data['PPNET']) data['W1_A11']= pandas.to_numeric(data['W1_A11']) data['W1_D9']= pandas.to_numeric(data['W1_D9'])
# Find percentage of observations thatindicate internet access ct1 = data.groupby('PPNET').size() pt1 = data.groupby('PPNET').size()*100/len(data)
c1 = data['PPNET'].value_counts(sort=0,dropna=False) print('\n\nNumber of respondants with internet access ') print ('0 indicates no access - 1 indicates access') print(c1)
# Print blank lines print() print() print('Percentage of respondants that have internet access ') print ('0 indicates no access - 1 indicates access') print(pt1)
# Number of times you watched national news on internet or # TV last week # 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times # -1 indicates no response c2 = data['W1_A11'].value_counts(dropna=False) print('\n\n Number of times respondants watched news on internet or TV last week') print(' 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times') print(' -1 indicates no response') print(c2)
# Popularity of Hilary Clinton with respondant indicated by percentage # -1 indicates no response print ('\n\nPopularity of Hilary Clinton with respondant indicated by percentage') print ('Number of respondants that gave percentage in second column\n') print (' -1 indicates no response') c3 = data['W1_D9'].value_counts(dropna=False) print('rate hillary clinton ') print(c3)
Output :
runfile('/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python/c1_assignment2', wdir='/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python')
The number of observations in the data is 2294 The number of variables in the data is 436
Number of respondants with internet access 0 indicates no access - 1 indicates access 0     507 1    1787 Name: PPNET, dtype: int64
Percentage of respondants that have internet access 0 indicates no access - 1 indicates access PPNET 0    22.101133 1    77.898867 dtype: float64
Number of times respondants watched news on internet or TV last week 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times -1 indicates no response 1    516 8    463 3    284 2    264 4    245 6    242 5    146 7    123 -1     11 Name: W1_A11, dtype: int64
Popularity of Hilary Clinton with respondant indicated by percentage Number of respondants that gave percentage in second column
-1 indicates no response rate hillary clinton 85     456 100    343 70     314 60     194 50     188 0      108 15      94 40      94 30      78 90      63 998     62 80      51 -1       51 75      48 95      27 65      21 55      11 10       8 5        7 20       7 25       7 35       5 99       4 45       4 62       3 94       3 2        3 87       3 69       3 98       3 79       2 72       2 4        2 89       2 59       2 3        2 74       1 57       1 29       1 88       1 6        1 93       1 12       1 91       1 22       1 36       1 76       1 38       1 37       1 96       1 58       1 97       1 7        1 68       1 92       1 Name: W1_D9, dtype: int64
The output gives the distribution of 3 variables in the input data.
The first is internet access, whether the number of respondents that have or have not internet access, respondants that have uninitialised responses are ignored. As there are as many values in distribution as in total observations all respondents have a value for this field.
The second is number of times watch national news on tv or internet during the week,  the second column gives the number of respondents the first column indicates number of times per week. Uninitialised values from respondent are ignored.
The third distribution gives Popularity of Hilary Clinton with group of  respondants indicated by percentage, Number of respondants that gave percentage in second column
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Analysis Tools Assignment 1
 Data Analysis Tools Assignment 1
Model Interpretation for ANOVA:
When examining the outlook on life 2012 data set and the association between number of days watched national news program on tv or internet (categorical explanatory variable) and the opinion of Hillary Clinton  (quantitative response) , an Analysis of Variance (ANOVA) revealed that , there was a statistically significant association between number of times that a person watched the national news weekly and their opinion of Hillary Clinton. See results below
The model interpretation for posthoc Anova results (Tukey) indicated that those people who watched national news 4,5, or 7 times a week reported significantly more positive opinion of Hillary Clinton  compared to those that watched the national news 0 or 1 time per week all other comparisons were statistically similar. See code for analysis and results below
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018
@author: oonagh.obrien """
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018
@author: oonagh.obrien """
#Import libraries for doing data manipulation and statistical functions
import pandas
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))
# create new view of data frame with nulls removed # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1
sub1 = data[['W1_A11','W1_D9']].dropna() print (sub1)
# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['W1_D9']= pandas.to_numeric(sub1['W1_D9'])
# Output the number of people in each category of times per week grouped = sub1.groupby('W1_A11') print (grouped.count())
# Output the average opinion of Hillary Clinton of people # in each category of times per week meanOpinion = sub1.groupby('W1_A11').mean() print(meanOpinion)
# Output the standard deviation in opinion of Hillary Clinton of people # in each category of times per week stdOpinion = sub1.groupby('W1_A11').std() print(stdOpinion)
#Use statistical function ols to calculate ANOVA   # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable opinion_model = smf.ols(formula ='W1_D9 ~ C(W1_A11)', data=sub1).fit() print (opinion_model.summary())
# Having got p-value less than 0.05 find what does # difference between means in model mean ... # As explanatory variable has more than two groups need to determine # which groups different from the others # need to perform post hoc test and avoid family wise type 1 error so # use Tukey's honestly significant difference test
mc = multi.MultiComparison(sub1['W1_D9'], sub1['W1_A11']) res1 = mc.tukeyhsd() print(res1.summary())
Tumblr media Tumblr media Tumblr media
0 notes
oonaghmaryobrien-blog · 7 years ago
Text
Data Mgmt and Visualisation Assignment 1
 I have selected the Outlook on Life Dataset 2012 produced by Belinda Robnett and Katherine Tate
My research is to determine is there a negative link between having internet access and popularity of Hilary Clinton and secondly if watching news on tv or the internet is an indicator for popularity of Hilary Clinton.
1.    Hypothesis to test if internet access is a predictor of popularity for Hilary Clinton
2.    Hypothesis that watching news on tv or internet is a predictor of popularity for Hilary Clinton
A search of academic articles using the search terms ‘Hillary Clinton Voter News Source’ generated 4180 results. 
According to Boxell and Gentzkow, 2017 actual data gathered from Trump voters after the 2016 presidential election did not indicate that  ‘internet media and online campaign methods conferred an advantage to Trump compared to other Republican presidential candidates in the internet era’ as is often suggested in discussion about Cambridge analytics and Russian influence. 
They found that ‘Relative to prior years, the Republican share of the vote in 2016 was as high or higher among the groups least active online.’ 
Other research indicates that Russian intervention, and Republicans’ success in “marrying content with delivery and data” (Johnson 2017) online many have influenced the election Others have emphasized the Trump campaign’s use of data to target messages online (Confessore and Hakim 2017) was successful. 
Generally it is suggested that the publics opinion on Hilary Clinton was influenced negatively on social media. This research will investigate if having access to the  internet in 2012 was an indicator of popularity for Hilary Clinton.
In the research I will use the categorical data on whether a person had internet access or not  (PPNET - Internet Access) and the quantifiable data on participants’ opinion of Hillary Clinton(w1-D9 ‘ How would you rate Hillary Clinton’) to see if there is a link between, internet access and opinion of Hillary Clinton to support my first hypothesis ‘Internet access is an indicator of popularity for Hillary Clinton’ . The unique identifier case-id will be used to distinguish each participant in the survey. The w1_a11 - ‘how many days did you watch national news programs on television or internet’ may be used of further research for second hypothesis to see link between sources of news and opinion of Hillary Clinton - w1_D9 - ‘ How would you rate Hillary Clinton’. Pew Research Centre found that the main source of news for Trump Voters was Fox News.
Confessore, Nicholas and Danny Hakim. 2017. Data firm says ‘secret sauce’ aided Trump; Many scoff. New York Times. Available at https://www.nytimes.com/2017/03/06/us/politics/cambridge-analytica.html. Accessed September 14, 2017. 
Hampton, Keith N. and Eszter Hargittai. 2016. Stop blaming Facebook for Trump’s election win. The Hill. Available at http://thehill.com/blogs/pundits-blog/presidential-campaign/307438-stop-blaming-facebook- for-trumps-election-win. Accessed June 14, 2017. 
Johnson, Eric. 2017. Full transcript: Hillary Clinton at Code 2017. recode.net. Available at https://www.recode.net/2017/5/31/15722218/hillary-clinton-code-conference-transcript-donald- trump-2016-russia-walt-mossberg-kara-swisher. Accessed September 21, 2017. 
Shapiro, Jesse M. Gentzkow, Matthew, 2017. A Note on Internet Use and the 2016 Election Outcome, Brown University and NBER September 2017
Pew Research Centre, 2017, Trump, Clinton Voters Divided in Their Main Source for Election News, Journalism & Media, Pew Research Centre January 18 2017
1 note · View note