oonaghmaryobrien-blog - Tumblr blog

oonaghmaryobrien-blog · 7 years ago

Text

Assignment 4 Data Analysis and Visualisation

Code :-

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas import numpy import seaborn import matplotlib.pyplot as plt

# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# create new view of data frame with nulls removed # include has internet access variable # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1

sub1 = data[['W1_A11','PPNET','W1_D9']] #print('Description of 3 variables \n') #print (sub1.describe())

#make a copy of my new subsetted data sub2 = sub1.copy()

# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])

#print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)

# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) #print ('\nFrequency table for original W1_A11 - how many times watch National News') #print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) #print(c1)

#recoding values for W1_A11 into a new variable, TIMES_WATCH_NAT_NEWS - 1..7 recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7, -1:0} sub2['TIMES_WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1)

#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["TIMES_WATCH_NAT_NEWS"] = sub2["TIMES_WATCH_NAT_NEWS"].astype('category')

seaborn.countplot(x="TIMES_WATCH_NAT_NEWS", data=sub2); plt.xlabel('Number of times watch national news per week ') plt.ylabel('Number of responses') plt.title('Univariate Graph of Number of times watch national news per week') fig = plt.gcf() fig.savefig('TIMES_WATCH_NAT_NEWS') plt.show()

#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) #print ('\nFrequency table for new recoded variable WATCH_NAT_NEWS') #print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) #print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] #print ('\nFrequency table for new variable HAS_INTERNET') #print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) #print(c1)

print ('\nFrequency table for original variable PPNET') print('- has internet or not') print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all

#univariate bar graph for categorical variables # First hange format from numeric to categorical sub2["PPNET"] = sub2["PPNET"].astype('category')

seaborn.countplot(x="PPNET", data=sub2); plt.xlabel('Has internet or not - 0 indicate no, 1 indicates yes') plt.ylabel('Number of responses') plt.title('Univariate Graph of Internet access or not') fig = plt.gcf() fig.savefig('HASINTERNET') plt.show()

#print ('\nFrequency table for W1_D9') #print('- Hillary Clinton popularity') #print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1)

# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)

#print ('\nFrequency table for variable W1_D9') #print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) #print(c1) #print('\n\n')

#Univariate histogram for quantitative variable: seaborn.distplot(sub2["W1_D9"].dropna(), kde=False); plt.xlabel('Popularity of Hillary Clinton') plt.ylabel('Number of responses') plt.title('Univariate Graph of Popularity of Hillary Clinton') fig = plt.gcf() fig.savefig('HILLARYPOPULARITY') plt.show()

sub2["W1_D9"].dropna()

# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="TIMES_WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Times watch National News from life dataset 2012 study') plt.xlabel('Times watch national news') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYTIMES') plt.show()

# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="WATCH_NAT_NEWS", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to whether watch national news from life dataset 2012 study') plt.xlabel('Watch national news- 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYNEWS') plt.show()

# bivariate bar graph C->Q ... Times watch national news access explanatory, for Hillary Popularity seaborn.factorplot(x="HAS_INTERNET", y="W1_D9", data=sub2, kind="bar", ci=None) plt.title('Bivariate graph of Hillary Popularity as response to Internet Access from life dataset 2012 study') plt.xlabel('Internet Access') plt.xlabel('Has Internet - 0 indicate no, 1 indicates yes') plt.ylabel('Hillary Popularity') fig = plt.gcf() fig.savefig('HILLARYINTERNET') plt.show()

The univariate graph of number of times watch TV per week:

This is a unimodal graph , with mode at its highest peak at the watching national news 0 times per week. The second highest peak is at 7 times perweek It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories except for the final category 7 times per week which is second highest.

The univariate graph of the categorical variable internet access.

This graph is unimodal, with its highest peak at having internet access, with 1750 respondants of approximately 2250 having internet access.

The univariate graph of quantifiable variable Hillary Clinton Popularity:

This graph is unimodal, with its highest peak at the category of approximately 85*% popularity for 460 respondants. The graph seems to be skewed to the left as higher responses for greater values in popularity.

Bivariate Graph of Number of times watch National News with Hillary Clinton Popularity

The graph above plots the number of times watch national news per week to level of Hillary Clinton popularity. We can see that the bar chart graph does not show a clear relationship/trend between the two variables, although the graph is slightly skewed to the left as the popularity increases as number of times watching tv increases.

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Assignment 4

Code :-

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Fri Jul 20 07:53:09 2018

@author: oonagh.obrien

"""

# -*- coding: utf-8 -*-

"""

Created on Mon Sep 21 10:18:43 2015

@author: jml

"""

# ANOVA

import numpy

import pandas

import statsmodels.formula.api as smf

import seaborn

import matplotlib.pyplot as plt

# Read in outlook on life dataset 2012 to data variable

# Check how many rows and columns in data

data = pandas.read_csv('ool_pds.csv',low_memory=0)

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

# create new view of data frame with nulls removed

# include

# W1_D9 Hillary Clinton Popularity

# W1_C1 - Republican (1), Democrat (2), Independant (3), Something Else (4), Refused (-1)

# W1_B4 - Extremely Angry (1), Very Angry (2), Somewhat Angry (3),

# A little Angry (4), Not Angry at all (5), Refused (-1)

sub1 = data[['W1_C1','W1_B4','W1_D9']]

#make a copy of my new subsetted data

sub2 = sub1.copy()

# Convert data to numberic

sub2['W1_C1']= pandas.to_numeric(sub2['W1_C1'])

sub2['W1_B4']= pandas.to_numeric(sub2['W1_B4'])

sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])

#print ('\nFrequency table for original W1_C1 - Political allegianc')

sub2['W1_C1']=sub2['W1_C1'].replace(-1, numpy.nan)

c1 = sub2['W1_C1'].value_counts(sort=False, dropna=False)

#print(c1)

# W1_D9 - Hillary Clinton Popularity

# recode missing values -1 refused or 998 don't recognise to python missing (NaN)

sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan)

sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)

#print ('\nFrequency table for variable W1_D9')

#print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan')

#print ('\nFrequency table for original W1_D9 - Hillary Clinton Popularity')

c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False)

#print(c1)

#print ('\nFrequency table for original W1_B4 - Level of anger')

#recode from increasing number to decreasing anger to increasing number increasing anger

recode1 = {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}

sub2['W1_B4']= sub2['W1_B4'].map(recode1)

sub2['W1_B4']=sub2['W1_B4'].replace(-1, numpy.nan)

c1 = sub2['W1_B4'].value_counts(sort=False, dropna=False)

#print(c1)

#Anova for popularity and political allegiance

model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=sub2).fit()

print (model1.summary())

#Anova for popularity and political allegiance

model1 = smf.ols(formula='W1_D9 ~ C(W1_B4)', data=sub2).fit()

#print (model1.summary())

groupdata1 = sub2[['W1_D9','W1_B4']].dropna()

groupdata2 = sub2[['W1_D9','W1_C1']].dropna()

#Output means for level of anger

print ('Means for level of anger')

m1= groupdata1.groupby('W1_B4').mean()

#print (m1)

#Output std dev for level of anger

print ('Std Dev for level of anger')

m1= groupdata1.groupby('W1_B4').std()

#print (m1)

#Output means for political affiliation

print ('Means for political affiliation')

m1= groupdata2.groupby('W1_C1').mean()

print (m1)

#Output std dev for political affiliation

print ('Std Dev for political affiliation')

m1= groupdata2.groupby('W1_C1').std()

print (m1)

#seaborn.factorplot(x="W1_B4", y="W1_D9", data=groupdata1, kind="bar", ci=None)

#plt.xlabel('Level of Anger')

#plt.ylabel('Hillary popularity')

#plt.title('Level of Anger v Hillary C popularity')

seaborn.factorplot(x="W1_C1", y="W1_D9", data=groupdata2, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Political Affiliation v Hillary C popularity')

plt.show()

# Consider moderator of level of anger - subgroup data depending on

# level of anger

# Not angry at all

grp1=sub2[(sub2['W1_B4']==1)]

# A little angry

grp2=sub2[(sub2['W1_B4']==2)]

# Somewhat angry

grp3=sub2[(sub2['W1_B4']==3)]

# Very Angry

grp4=sub2[(sub2['W1_B4']==4)]

# Extremely Angry

grp5=sub2[(sub2['W1_B4']==5)]

print ('association between Political affiliation and HC popularity for Not angry')

model1 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp1).fit()

print (model1.summary())

print ('association between Political affiliation and HC popularity for A little angry')

model2 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp2).fit()

print (model2.summary())

print ('association between Political affiliation and HC popularity for Somewhat angry')

model3 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp3).fit()

print (model3.summary())

print ('association between Political affiliation and HC popularity for Veryangry')

model4 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp4).fit()

print (model4.summary())

print ('association between Political affiliation and HC popularity for Extremely angry')

model5 = smf.ols(formula='W1_D9 ~ C(W1_C1)', data=grp5).fit()

print (model5.summary())

print ("means for Popularity by Political affiliation Not angry")

m1= grp1.groupby('W1_C1').mean()

print (m1)

print ("means for Popularity by Political affiliation A little angry")

m2= grp2.groupby('W1_C1').mean()

print (m2)

print ("means for Popularity by Political affiliation Somewhat angry")

m3= grp3.groupby('W1_C1').mean()

print (m3)

print ("means for Popularity by Political affiliation Very angry")

m4= grp4.groupby('W1_C1').mean()

print (m4)

print ("means for Popularity by Political affiliation Extremely angry")

m5= grp5.groupby('W1_C1').mean()

print (m5)

print()

seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp1, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Level of Anger v Hillary C popularity not angry')

seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp2, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Political Affiliation v Hillary C popularity a little angry')

seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp3, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Level of Anger v Hillary C somewhat angry')

seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp4, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Political Affiliation v Hillary C popularity very angry')

seaborn.factorplot(x="W1_C1", y="W1_D9", data=grp5, kind="bar", ci=None)

plt.xlabel('Political Affiliation')

plt.ylabel('Hillary popularity')

plt.title('Political Affiliation v Hillary C popularity very angry')

---------------------------------

Output :-

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.271

Model: OLS Adj. R-squared: 0.270

Method: Least Squares F-statistic: 266.3

Date: Sat, 21 Jul 2018 Prob (F-statistic): 6.24e-147

Time: 10:33:39 Log-Likelihood: -9906.3

No. Observations: 2154 AIC: 1.982e+04

Df Residuals: 2150 BIC: 1.984e+04

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 39.2125 1.346 29.140 0.000 36.574 41.851

C(W1_C1)[T.2.0] 39.3153 1.514 25.976 0.000 36.347 42.283

C(W1_C1)[T.3.0] 19.8492 1.701 11.668 0.000 16.513 23.185

C(W1_C1)[T.4.0] 12.3853 2.848 4.349 ��0.000 6.801 17.970

==============================================================================

Omnibus: 152.720 Durbin-Watson: 1.964

Prob(Omnibus): 0.000 Jarque-Bera (JB): 186.379

Skew: -0.677 Prob(JB): 3.37e-41

Kurtosis: 3.495 Cond. No. 7.39

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between Political affiliation and HC popularity for Not angry

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.104

Model: OLS Adj. R-squared: 0.094

Method: Least Squares F-statistic: 10.15

Date: Sat, 21 Jul 2018 Prob (F-statistic): 2.38e-06

Time: 10:33:39 Log-Likelihood: -1212.6

No. Observations: 267 AIC: 2433.

Df Residuals: 263 BIC: 2448.

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 55.3125 5.719 9.672 0.000 44.052 66.573

C(W1_C1)[T.2.0] 24.3249 5.981 4.067 0.000 12.549 36.101

C(W1_C1)[T.3.0] 11.7036 6.415 1.825 0.069 -0.927 24.334

C(W1_C1)[T.4.0] 8.4097 7.860 1.070 0.286 -7.067 23.886

==============================================================================

Omnibus: 47.995 Durbin-Watson: 1.996

Prob(Omnibus): 0.000 Jarque-Bera (JB): 72.160

Skew: -1.074 Prob(JB): 2.14e-16

Kurtosis: 4.368 Cond. No. 10.5

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between Political affiliation and HC popularity for A little angry

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.191

Model: OLS Adj. R-squared: 0.187

Method: Least Squares F-statistic: 41.70

Date: Sat, 21 Jul 2018 Prob (F-statistic): 3.36e-24

Time: 10:33:39 Log-Likelihood: -2386.0

No. Observations: 533 AIC: 4780.

Df Residuals: 529 BIC: 4797.

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 44.8333 3.296 13.603 0.000 38.359 51.308

C(W1_C1)[T.2.0] 33.5203 3.482 9.628 0.000 26.681 40.360

C(W1_C1)[T.3.0] 17.3363 3.865 4.486 0.000 9.744 24.928

C(W1_C1)[T.4.0] 21.0490 6.140 3.428 0.001 8.988 33.110

==============================================================================

Omnibus: 63.753 Durbin-Watson: 2.073

Prob(Omnibus): 0.000 Jarque-Bera (JB): 87.388

Skew: -0.864 Prob(JB): 1.06e-19

Kurtosis: 3.974 Cond. No. 10.0

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between Political affiliation and HC popularity for Somewhat angry

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.282

Model: OLS Adj. R-squared: 0.278

Method: Least Squares F-statistic: 89.63

Date: Sat, 21 Jul 2018 Prob (F-statistic): 6.07e-49

Time: 10:33:39 Log-Likelihood: -3099.6

No. Observations: 690 AIC: 6207.

Df Residuals: 686 BIC: 6225.

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 40.5978 2.260 17.965 0.000 36.161 45.035

C(W1_C1)[T.2.0] 37.8421 2.497 15.154 0.000 32.939 42.745

C(W1_C1)[T.3.0] 20.0949 2.817 7.133 0.000 14.564 25.626

C(W1_C1)[T.4.0] 13.1522 5.871 2.240 0.025 1.625 24.680

==============================================================================

Omnibus: 84.449 Durbin-Watson: 1.932

Prob(Omnibus): 0.000 Jarque-Bera (JB): 126.149

Skew: -0.843 Prob(JB): 4.05e-28

Kurtosis: 4.244 Cond. No. 9.06

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between Political affiliation and HC popularity for Veryangry

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.231

Model: OLS Adj. R-squared: 0.226

Method: Least Squares F-statistic: 41.42

Date: Sat, 21 Jul 2018 Prob (F-statistic): 2.04e-23

Time: 10:33:39 Log-Likelihood: -1951.3

No. Observations: 417 AIC: 3911.

Df Residuals: 413 BIC: 3927.

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 41.8913 2.730 15.343 0.000 36.524 47.258

C(W1_C1)[T.2.0] 35.3237 3.338 10.582 0.000 28.762 41.885

C(W1_C1)[T.3.0] 15.4104 3.656 4.215 0.000 8.224 22.597

C(W1_C1)[T.4.0] 9.7174 6.105 1.592 0.112 -2.284 21.719

==============================================================================

Omnibus: 28.366 Durbin-Watson: 1.755

Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.526

Skew: -0.680 Prob(JB): 8.65e-08

Kurtosis: 3.147 Cond. No. 6.03

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between Political affiliation and HC popularity for Extremely angry

OLS Regression Results

==============================================================================

Dep. Variable: W1_D9 R-squared: 0.343

Model: OLS Adj. R-squared: 0.334

Method: Least Squares F-statistic: 40.32

Date: Sat, 21 Jul 2018 Prob (F-statistic): 5.21e-21

Time: 10:33:39 Log-Likelihood: -1131.3

No. Observations: 236 AIC: 2271.

Df Residuals: 232 BIC: 2284.

Df Model: 3

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 28.0800 3.402 8.253 0.000 21.377 34.783

C(W1_C1)[T.2.0] 51.2836 4.973 10.313 0.000 41.486 61.081

C(W1_C1)[T.3.0] 19.1668 4.780 4.010 0.000 9.749 28.585

C(W1_C1)[T.4.0] -4.0244 7.733 -0.520 0.603 -19.261 11.212

==============================================================================

Omnibus: 10.111 Durbin-Watson: 2.182

Prob(Omnibus): 0.006 Jarque-Bera (JB): 5.862

Skew: �� 0.205 Prob(JB): 0.0533

Kurtosis: 2.346 Cond. No. 4.91

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for Popularity by Political affiliation Not angry

W1_B4 W1_D9

W1_C1

1.0 1.0 55.312500

2.0 1.0 79.637427

3.0 1.0 67.016129

4.0 1.0 63.722222

means for Popularity by Political affiliation A little angry

W1_B4 W1_D9

W1_C1

1.0 2.0 44.833333

2.0 2.0 78.353591

3.0 2.0 62.169643

4.0 2.0 65.882353

means for Popularity by Political affiliation Somewhat angry

W1_B4 W1_D9

W1_C1

1.0 3.0 40.597826

2.0 3.0 78.439904

3.0 3.0 60.692771

4.0 3.0 53.750000

means for Popularity by Political affiliation Very angry

W1_B4 W1_D9

W1_C1

1.0 4.0 41.891304

2.0 4.0 77.215054

3.0 4.0 57.301724

4.0 4.0 51.608696

means for Popularity by Political affiliation Extremely angry

W1_B4 W1_D9

W1_C1

1.0 5.0 28.080000

2.0 5.0 79.363636

3.0 5.0 47.246753

4.0 5.0 24.055556

Political Affiliation v Hillary Clinton Popularity for respondents who are extremely angry

Analysis :-

Anova to examine association between Hillary Clinton's popularity rating and political affiliation, has a very small p-value and significant f-statistic which indicates a statistically significant relationship between the two variables.

Further examination of the mean, standard deviation, the mode and graphing the association indicates that highest value for popularity is when political affiliation is democrat and lowest when republican.

The variable W1_C1- 'level of anger' was considered as a possible moderator of the association between 'political affiliation' and 'popularity of Hillary Clinton' and this was investigated by making a subgroup for each level of anger - not angry, somewhat angry,

Anova was done for relationship between 'political affiliation' and 'popularity of Hillary Clinton'. It was found that every subgroup had a very small p-value and large f-statistic which indicated a statistically significant relationship between the two variables.

From examing the means and standard deviations and graphing the relationships between 'political affiliation' and 'popularity of Hillary Clinton' for each of the subgroups it appeared that 'level of anger' was a moderator for the association,, with popularity decreasing for those who were not democrats as level of anger increased.

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Generating Correlation Coefficient do determine correlation between Number of times Watch National News Per week and Hillary Clinton’s popularity

------------------------------

CODE

-------------------------------

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0)

# create new view of data frame with nulls removed # include quantifiable variable W1_A11 number of times per week national news watched # on tv or internet and quantifiable variable W1_D9 - Hillary Clinton popularity

sub1 = data[['W1_A11','W1_D9']] print('Description of 2 variables \n') print (sub1.describe())

#make a copy of my new subsetted data sub2 = sub1.copy()

# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])

print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)

#Recode values for times watched from 0 to 7 rather than 1 to 8 as easier #to interpet on graph recode1 = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7:6, 8:7} sub2['W1_A11_RECODE']= sub2['W1_A11'].map(recode1) c1 = sub2['W1_A11_RECODE'].value_counts(sort=False, dropna=False) print(c1)

print ('\nFrequency table for W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)

# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)

scat1 = seaborn.regplot(x="W1_A11_RECODE", y="W1_D9", fit_reg=True, data=sub2) plt.xlabel('Times Watch National News') plt.ylabel('Rating for Hillary Clinton popularity') plt.title('Scatterplot for the Association Between \n Times per week watch National News and Hillary Clinton popularity')

data_clean = sub2.dropna()

print ('Association Between Times per week watch National News and Hillary Clinton popularity') print ('Pearson co-efficient |p-value') print (scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']))

result= scipy.stats.pearsonr(data_clean['W1_A11_RECODE'], data_clean['W1_D9']) print('\nRSquared or Coefficient of Determination')

print(result[0]*result[0])

--------------------

OUTPUT TO EXAMINE CORRELATION

--------------------

Association Between Times per week watch National News and Hillary Clinton popularity Pearson co-efficient |p-value (0.12865812149374659, 1.73963015355937e-09)

RSquared or Coefficient of Determination 0.016552912226299656

------------------------

ASSESSMENT OF OUTPUT

--------------------------

The correlation coefficient assesses the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. A correlation of -1 means there is a perfect, negative linear relationship between the two variables.

In my example I calculated the correlation coefficient to assess the degree of linear relationship between number of times national news is watched each week (explanatory variable) and popularity of Hillary Clinton (response variable)

The results were a Pearson co-efficient 0.12865812149374659 and a p-value of 1.73963015355937e-09).

The Pearson co-efficient was very close to 0 which indicated a weak correlation between number of times national news is watched each week (and popularity of Hillary Clinton, while the p-value was small this was not important as correlation was weak.

TheRSquared or Coefficient of Determination was 0.016552912226299656, again very small which indicated that in less than 20% of the time that Hillary Clinton’s variability in popularity could be explained by number of times watching national news per week.

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Management of variables on popularity, watching national news and internet access

-----------------

Code

------------------

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas import numpy

# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print('How many rows of data ') print(len(data)) print('How many variables or columns of data ') print (len(data.columns))

sub1 = data[['W1_A11','PPNET','W1_D9']] print('Description of 3 variables \n') print (sub1.describe())

#make a copy of my new subsetted data sub2 = sub1.copy()

# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])

print ('\nFrequency table for original W1_A11 - how many times watch National News') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)

# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan) print ('\nFrequency table for original W1_A11 - how many times watch National News') print('after recoding missing values to NaN') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1) #recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS - either does or does not watch news recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub2['WATCH_NAT_NEWS']= sub2['W1_A11'].map(recode1) print ('\nFrequency table for new recoded variable WATCH_NAT_NEWS') print ('- how many times watch National News') c1 = sub2['WATCH_NAT_NEWS'].value_counts(sort=False, dropna=False) print(c1) sub2['HAS_INTERNET']=sub2['PPNET'] print ('\nFrequency table for new variable HAS_INTERNET') print('- has internet or not') c1 = sub2['HAS_INTERNET'].value_counts(sort=False, dropna=False) print(c1)

print ('\nFrequency table for W1_D9') print('- Hillary Clinton popularity') print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)

# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)

print ('\nFrequency table for variable W1_D9') print('- Hillary Clinton popularity after missing values -1 and 998 converted to Nan') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1) print('\n\n') # quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4) print('\n\n')

#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9'])) print('\n\n')

#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')

print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')

------------------

OUTPUT

______________

How many rows of data 2294 How many variables or columns of data 436 Description of 3 variables

W1_A11 PPNET W1_D9 count 2294.000000 2294.000000 2294.000000 mean 4.190061 0.778989 90.412380 std 2.625410 0.415019 154.079581 min -1.000000 0.000000 -1.000000 25% 2.000000 1.000000 50.000000 50% 4.000000 1.000000 70.000000 75% 7.000000 1.000000 85.000000 max 8.000000 1.000000 998.000000

Frequency table for original W1_A11 - how many times watch National News 2 264 4 245 6 242 8 463 -1 11 1 516 3 284 5 146 7 123 Name: W1_A11, dtype: int64

Frequency table for original W1_A11 - how many times watch National News after recoding missing values to NaN 5.0 146 4.0 245 1.0 516 8.0 463 2.0 264 3.0 284 7.0 123 6.0 242 NaN 11 Name: W1_A11, dtype: int64

Frequency table for original new recoded variable WATCH_NAT_NEWS - how many times watch National News 1.0 1767 0.0 516 NaN 11 Name: WATCH_NAT_NEWS, dtype: int64

Frequency table for new variable HAS_INTERNET - has internet or not 0 507 1 1787 Name: HAS_INTERNET, dtype: int64

Frequency table for original variable PPNET - has internet or not counts for original PPNET 0 507 1 1787 Name: PPNET, dtype: int64

Frequency table for W1_D9 - Hillary Clinton popularity counts for original W1_D9 0 108 2 3 4 2 6 1 10 8 12 1 20 7 22 1 30 78 36 1 38 1 40 94 50 188 58 1 60 194 62 3 68 1 70 314 72 2 74 1 76 1 80 51 88 1 90 63 92 1 94 3 96 1 98 3 100 343 998 62 -1 51 3 2 5 7 7 1 15 94 25 7 29 1 35 5 37 1 45 4 55 11 57 1 59 2 65 21 69 3 75 48 79 2 85 456 87 3 89 2 91 1 93 1 95 27 97 1 99 4 Name: W1_D9, dtype: int64

Frequency table for variable W1_D9 - Hillary Clinton popularity after missing values -1 and 998 converted to Nan 0.0 108 80.0 51 100.0 343 60.0 194 50.0 188 40.0 94 30.0 78 NaN 113 15.0 94 5.0 7 59.0 2 2.0 3 72.0 2 10.0 8 92.0 1 76.0 1 25.0 7 4.0 2 3.0 2 20.0 7 36.0 1 7.0 1 62.0 3 6.0 1 38.0 1 88.0 1 58.0 1 96.0 1 68.0 1 29.0 1 12.0 1 22.0 1 65.0 21 95.0 27 89.0 2 99.0 4 97.0 1 70.0 314 85.0 456 90.0 63 55.0 11 87.0 3 45.0 4 37.0 1 57.0 1 94.0 3 74.0 1 35.0 5 98.0 3 75.0 48 69.0 3 79.0 2 93.0 1 91.0 1 Name: W1_D9, dtype: int64

Popularity - 4 categories - quartiles 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64

W1_D9 0.0 2.0 3.0 ... 98.0 99.0 100.0 POPULARITY4 ... 1=0-25 Percentile 108 3 2 ... 0 0 0 3=50-75 Percentile 0 0 0 ... 0 0 0 2=25-50 Percentile 0 0 0 ... 0 0 0 4=75-100 Percentile 0 0 0 ... 3 4 343

[4 rows x 53 columns]

counts for POPULARITY4 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64

percentages for POPULARITY4 1=0-25 Percentile 0.281981 2=25-50 Percentile 0.252636 3=50-75 Percentile 0.257221 4=75-100 Percentile 0.208161 Name: POPULARITY4, dtype: float64

------------------------------

SUMMARY

________________________

Distribution of the data for each of these variables is output as well as distribution after variable values have been adjusted to account for missing values

Distribution of newly created and recoded variables also output.

The variable W1_A11- how many times national news watched in week takes values between -1 and 8, -1 indicates missing value, 8 is the most popular response 468 times and indicates news watched daily, there are 11 missing values

The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access

The variable W1_D9 has values between -1 and 100, it also has the value 998. The values -1 and 998 indicate missing values. Of the data available 615 is lower than 25, 551 are between 25 and 50, 561 are between 50 and 75 and 454 are between 75 and 100

Change missing values to python nan

Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes

Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Management of variables on popularity, watching national news and internet access

Code

------------------------------

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas import numpy

sub1 = data[['W1_A11','PPNET','W1_D9']] print (sub1.describe())

#make a copy of my new subsetted data sub2 = sub1.copy()

# Convert data to numberic sub2['W1_A11']= pandas.to_numeric(sub2['W1_A11']) sub2['PPNET']= pandas.to_numeric(sub2['PPNET']) sub2['W1_D9']= pandas.to_numeric(sub2['W1_D9'])

print ('counts for original W1_A11') c1 = sub2['W1_A11'].value_counts(sort=False, dropna=False) print(c1)

# recode missing values -1 refused to python missing (NaN) sub2['W1_A11']=sub2['W1_A11'].replace(-1, numpy.nan)

print ('counts for original PPNET') c1 = sub2['PPNET'].value_counts(sort=False, dropna=False) print(c1) # No missing values - boolean value true or false provided for all

print ('counts for original W1_D9') c1 = sub2['W1_D9'].value_counts(sort=False, dropna=False) print(c1)

# recode missing values -1 refused or 998 don't recognise to python missing (NaN) sub2['W1_D9']=sub2['W1_D9'].replace(-1, numpy.nan) sub2['W1_D9']=sub2['W1_D9'].replace(998, numpy.nan)

# quartile split (use qcut function & ask for 4 groups - gives you quartile split) print ('Popularity - 4 categories - quartiles') sub2['POPULARITY4']=pandas.qcut(sub2.W1_D9, 4, labels=["1=0-25 Percentile","2=25-50 Percentile","3=50-75 Percentile","4=75-100 Percentile"]) c4 = sub2['POPULARITY4'].value_counts(sort=False, dropna=True) print(c4)

#crosstabs evaluating which popularity were put into which POPULARITY4 print (pandas.crosstab(sub2['POPULARITY4'], sub2['W1_D9']))

#frequency distribution for POPULARITY4 print ('counts for POPULARITY4') c10 = sub2['POPULARITY4'].value_counts(sort=False) print(c10) print ('\n\n')

print ('percentages for POPULARITY4') p10 = sub2['POPULARITY4'].value_counts(sort=False, normalize=True) print (p10) print ('\n\n')

Output

------------------------------------------------------------------------

How many rows of data 2294 How many variables or columns of data 436 W1_A11 PPNET W1_D9 count 2294.000000 2294.000000 2294.000000 mean 4.190061 0.778989 90.412380 std 2.625410 0.415019 154.079581 min -1.000000 0.000000 -1.000000 25% 2.000000 1.000000 50.000000 50% 4.000000 1.000000 70.000000 75% 7.000000 1.000000 85.000000 max 8.000000 1.000000 998.000000 counts for original W1_A11 2 264 4 245 6 242 8 463 -1 11 1 516 3 284 5 146 7 123 Name: W1_A11, dtype: int64 counts for original PPNET 0 507 1 1787 Name: PPNET, dtype: int64 counts for original W1_D9 0 108 2 3 4 2 6 1 10 8 12 1 20 7 22 1 30 78 36 1 38 1 40 94 50 188 58 1 60 194 62 3 68 1 70 314 72 2 74 1 76 1 80 51 88 1 90 63 92 1 94 3 96 1 98 3 100 343 998 62 -1 51 3 2 5 7 7 1 15 94 25 7 29 1 35 5 37 1 45 4 55 11 57 1 59 2 65 21 69 3 75 48 79 2 85 456 87 3 89 2 91 1 93 1 95 27 97 1 99 4 Name: W1_D9, dtype: int64 Popularity - 4 categories - quartiles 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64 W1_D9 0.0 2.0 3.0 ... 98.0 99.0 100.0 POPULARITY4 ... 1=0-25 Percentile 108 3 2 ... 0 0 0 3=50-75 Percentile 0 0 0 ... 0 0 0 2=25-50 Percentile 0 0 0 ... 0 0 0 4=75-100 Percentile 0 0 0 ... 3 4 343

[4 rows x 53 columns] counts for POPULARITY4 1=0-25 Percentile 615 2=25-50 Percentile 551 3=50-75 Percentile 561 4=75-100 Percentile 454 Name: POPULARITY4, dtype: int64

percentages for POPULARITY4 1=0-25 Percentile 0.281981 2=25-50 Percentile 0.252636 3=50-75 Percentile 0.257221 4=75-100 Percentile 0.208161 Name: POPULARITY4, dtype: float64

--------------------------------------

Read in data set - see how many rows of data and columns/variables

Create a copy of the data which holds the three variables of interest

Look at the distribution of the data for each of these variables.

The variable PPNET - internet access has no missing values, has a value 0 to indicate no access and 1 to indicate access, 1787 or total respondents have internet access

Change missing values to python nan

Recoded the number of times National News was watched in new variable which will indicate if News was watched at least once in the week or not - boolean 0 false, 1 yes

Did quartile split of popularity and output results, output details of quartile split in terms of count and percentages

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Analysis Tools Assignment 2

Assignment 2 Data Analysis Tools

Explanation :-

Using data from outlook on life surveys 2012 determine if there a link between watching national news and having internet access that is statistically significant are the two variables independent or dependant ?

Is the rate of watching national news equal or not equal for those that do and do not have internet access

Using data from outlook on life surveys 2012.

PPNet : HH Internet access

WW_A11 : How many days last week did you watch national news on the television or internet

1 non

2 one

…

-1 refused

Ho Having internet access effects whether you watch national news

Ha Having internet access does not effect whether you watch national news

1: Change quantifiable data (number of times per week watched national news to categorical - did or did not watch national news)

2: Did a chi squared test

3: Graphed results - post hoc test not nessecary as both variables had only two values

Since the P-value (0.0005) is less than the significance level (0.05), we cannot accept the null hypothesis therefore there is a statistically significant link between having internet access and watching tv.

------------------------------------------------------

Code :-

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas import scipy.stats as scipy import seaborn import matplotlib.pyplot as plt

# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))

sub1 = data[['W1_A11','PPNET']].dropna() print (sub1.describe())

# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['PPNET']= pandas.to_numeric(sub1['PPNET'])

#recoding values for W1_A11 into a new variable, WATCH_NAT_NEWS recode1 = {1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7:1, 8:1, -1:0} sub1['WATCH_NAT_NEWS']= sub1['W1_A11'].map(recode1) sub1['HAS_INTERNET']=sub1['PPNET']

# Output the number of people in each category of times per week grouped = sub1.groupby('WATCH_NAT_NEWS') print (grouped.count())

# create contingency table of observed counts contingency_table =pandas.crosstab(sub1['WATCH_NAT_NEWS'], sub1['HAS_INTERNET']) print (contingency_table)

# calculate column percentages - percentages of those who have internet that do/don't watch national news # percentages of those who do not have internet that do/don't watch national news col_totals =contingency_table.sum(axis=0) percentages =contingency_table/col_totals print(percentages)

# calculate chi-square print ('chi-square value, p value, expected counts')

chi_square = scipy.chi2_contingency(contingency_table) print (chi_square)

# set variable types sub1["HAS_INTERNET"] = sub1["HAS_INTERNET"].astype('category') # new code for setting variables to numeric: sub1['WATCH_NAT_NEWS'] = pandas.to_numeric(sub1['WATCH_NAT_NEWS'], errors='coerce')

# old code for setting variables to numeric: #sub2['TAB12MDX'] = sub2['TAB12MDX'].convert_objects(convert_numeric=True)

# graph percent with nicotine dependence within each smoking frequency group seaborn.factorplot(x="HAS_INTERNET", y="WATCH_NAT_NEWS", data=sub1, kind="bar", ci=None) plt.xlabel('Has Internet') plt.ylabel('Proportion Watch National News')

----------------------------------------

OUTPUT

----------------------------------------

2294 436 W1_A11 PPNET count 2294.000000 2294.000000 mean 4.190061 0.778989 std 2.625410 0.415019 min -1.000000 0.000000 25% 2.000000 1.000000 50% 4.000000 1.000000 75% 7.000000 1.000000 max 8.000000 1.000000 W1_A11 PPNET HAS_INTERNET WATCH_NAT_NEWS 0 527 527 527 1 1767 1767 1767 HAS_INTERNET 0 1 WATCH_NAT_NEWS 0 146 381 1 361 1406 HAS_INTERNET 0 1 WATCH_NAT_NEWS 0 0.287968 0.213206 1 0.712032 0.786794 chi-square value, p value, expected counts (12.056067718516204, 0.0005162407168329077, 1, array([[ 116.47297297, 410.52702703], [ 390.52702703, 1376.47297297]])) Out[23]: Text(6.8,0.5,'Proportion Watch National News')

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Mgmt and Visualisation Assignment 2

Th1: Program

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

import pandas import numpy

data = pandas.read_csv('ool_pds.csv',low_memory=0)

print() print() print('The number of observations in the data is ') print(len(data)) print('The number of variables in the data is ') print (len(data.columns))

sub1 = data[['PPNET','W1_A11','W1_D9']]

#print(data.describe())

#print(data['PPNET'].describe()) # Convert the data from variables on W1-D9 opinion of hilary clinton # Access to internet PPNET and # number of times watch news per week W1_A11

data['PPNET']= pandas.to_numeric(data['PPNET']) data['W1_A11']= pandas.to_numeric(data['W1_A11']) data['W1_D9']= pandas.to_numeric(data['W1_D9'])

# Find percentage of observations thatindicate internet access ct1 = data.groupby('PPNET').size() pt1 = data.groupby('PPNET').size()*100/len(data)

c1 = data['PPNET'].value_counts(sort=0,dropna=False) print('\n\nNumber of respondants with internet access ') print ('0 indicates no access - 1 indicates access') print(c1)

# Print blank lines print() print() print('Percentage of respondants that have internet access ') print ('0 indicates no access - 1 indicates access') print(pt1)

# Number of times you watched national news on internet or # TV last week # 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times # -1 indicates no response c2 = data['W1_A11'].value_counts(dropna=False) print('\n\n Number of times respondants watched news on internet or TV last week') print(' 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times') print(' -1 indicates no response') print(c2)

# Popularity of Hilary Clinton with respondant indicated by percentage # -1 indicates no response print ('\n\nPopularity of Hilary Clinton with respondant indicated by percentage') print ('Number of respondants that gave percentage in second column\n') print (' -1 indicates no response') c3 = data['W1_D9'].value_counts(dropna=False) print('rate hillary clinton ') print(c3)

Output :

runfile('/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python/c1_assignment2', wdir='/Users/oonagh.obrien/Documents/2017 Semester 1/ 2017 Semester 2/Doctorate/statistics/python')

The number of observations in the data is 2294 The number of variables in the data is 436

Number of respondants with internet access 0 indicates no access - 1 indicates access 0 507 1 1787 Name: PPNET, dtype: int64

Percentage of respondants that have internet access 0 indicates no access - 1 indicates access PPNET 0 22.101133 1 77.898867 dtype: float64

Number of times respondants watched news on internet or TV last week 1 indicates 0 times, 2 indicates 1 time, ...... 8 indicates 7 times -1 indicates no response 1 516 8 463 3 284 2 264 4 245 6 242 5 146 7 123 -1 11 Name: W1_A11, dtype: int64

Popularity of Hilary Clinton with respondant indicated by percentage Number of respondants that gave percentage in second column

-1 indicates no response rate hillary clinton 85 456 100 343 70 314 60 194 50 188 0 108 15 94 40 94 30 78 90 63 998 62 80 51 -1 51 75 48 95 27 65 21 55 11 10 8 5 7 20 7 25 7 35 5 99 4 45 4 62 3 94 3 2 3 87 3 69 3 98 3 79 2 72 2 4 2 89 2 59 2 3 2 74 1 57 1 29 1 88 1 6 1 93 1 12 1 91 1 22 1 36 1 76 1 38 1 37 1 96 1 58 1 97 1 7 1 68 1 92 1 Name: W1_D9, dtype: int64

The output gives the distribution of 3 variables in the input data.

The first is internet access, whether the number of respondents that have or have not internet access, respondants that have uninitialised responses are ignored. As there are as many values in distribution as in total observations all respondents have a value for this field.

The second is number of times watch national news on tv or internet during the week, the second column gives the number of respondents the first column indicates number of times per week. Uninitialised values from respondent are ignored.

The third distribution gives Popularity of Hilary Clinton with group of respondants indicated by percentage, Number of respondants that gave percentage in second column

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Analysis Tools Assignment 1

Model Interpretation for ANOVA:

When examining the outlook on life 2012 data set and the association between number of days watched national news program on tv or internet (categorical explanatory variable) and the opinion of Hillary Clinton (quantitative response) , an Analysis of Variance (ANOVA) revealed that , there was a statistically significant association between number of times that a person watched the national news weekly and their opinion of Hillary Clinton. See results below

The model interpretation for posthoc Anova results (Tukey) indicated that those people who watched national news 4,5, or 7 times a week reported significantly more positive opinion of Hillary Clinton compared to those that watched the national news 0 or 1 time per week all other comparisons were statistically similar. See code for analysis and results below

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon Jun 25 08:54:30 2018

@author: oonagh.obrien """

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 21 11:22:34 2018

@author: oonagh.obrien """

#Import libraries for doing data manipulation and statistical functions

import pandas

import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

# Read in outlook on life dataset 2012 to data variable # Check how many rows and columns in data data = pandas.read_csv('ool_pds.csv',low_memory=0) print(len(data)) print (len(data.columns))

# create new view of data frame with nulls removed # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable # Reference this view of data with sub1

sub1 = data[['W1_A11','W1_D9']].dropna() print (sub1)

# Convert data to numberic sub1['W1_A11']= pandas.to_numeric(sub1['W1_A11']) sub1['W1_D9']= pandas.to_numeric(sub1['W1_D9'])

# Output the number of people in each category of times per week grouped = sub1.groupby('W1_A11') print (grouped.count())

# Output the average opinion of Hillary Clinton of people # in each category of times per week meanOpinion = sub1.groupby('W1_A11').mean() print(meanOpinion)

# Output the standard deviation in opinion of Hillary Clinton of people # in each category of times per week stdOpinion = sub1.groupby('W1_A11').std() print(stdOpinion)

#Use statistical function ols to calculate ANOVA # include response variable ..... opinion of Hillary Clinton W1_D9 # and W1_A11 number of times per week national news watched # on tv or internet of news categorical explanatory variable opinion_model = smf.ols(formula ='W1_D9 ~ C(W1_A11)', data=sub1).fit() print (opinion_model.summary())

# Having got p-value less than 0.05 find what does # difference between means in model mean ... # As explanatory variable has more than two groups need to determine # which groups different from the others # need to perform post hoc test and avoid family wise type 1 error so # use Tukey's honestly significant difference test

mc = multi.MultiComparison(sub1['W1_D9'], sub1['W1_A11']) res1 = mc.tukeyhsd() print(res1.summary())

0 notes

oonaghmaryobrien-blog · 7 years ago

Text

Data Mgmt and Visualisation Assignment 1

I have selected the Outlook on Life Dataset 2012 produced by Belinda Robnett and Katherine Tate

My research is to determine is there a negative link between having internet access and popularity of Hilary Clinton and secondly if watching news on tv or the internet is an indicator for popularity of Hilary Clinton.

1. Hypothesis to test if internet access is a predictor of popularity for Hilary Clinton

2. Hypothesis that watching news on tv or internet is a predictor of popularity for Hilary Clinton

A search of academic articles using the search terms ‘Hillary Clinton Voter News Source’ generated 4180 results.

According to Boxell and Gentzkow, 2017 actual data gathered from Trump voters after the 2016 presidential election did not indicate that ‘internet media and online campaign methods conferred an advantage to Trump compared to other Republican presidential candidates in the internet era’ as is often suggested in discussion about Cambridge analytics and Russian influence.

They found that ‘Relative to prior years, the Republican share of the vote in 2016 was as high or higher among the groups least active online.’

Other research indicates that Russian intervention, and Republicans’ success in “marrying content with delivery and data” (Johnson 2017) online many have influenced the election Others have emphasized the Trump campaign’s use of data to target messages online (Confessore and Hakim 2017) was successful.

Generally it is suggested that the publics opinion on Hilary Clinton was influenced negatively on social media. This research will investigate if having access to the internet in 2012 was an indicator of popularity for Hilary Clinton.

In the research I will use the categorical data on whether a person had internet access or not (PPNET - Internet Access) and the quantifiable data on participants’ opinion of Hillary Clinton(w1-D9 ‘ How would you rate Hillary Clinton’) to see if there is a link between, internet access and opinion of Hillary Clinton to support my first hypothesis ‘Internet access is an indicator of popularity for Hillary Clinton’ . The unique identifier case-id will be used to distinguish each participant in the survey. The w1_a11 - ‘how many days did you watch national news programs on television or internet’ may be used of further research for second hypothesis to see link between sources of news and opinion of Hillary Clinton - w1_D9 - ‘ How would you rate Hillary Clinton’. Pew Research Centre found that the main source of news for Trump Voters was Fox News.

Confessore, Nicholas and Danny Hakim. 2017. Data firm says ‘secret sauce’ aided Trump; Many scoff. New York Times. Available at https://www.nytimes.com/2017/03/06/us/politics/cambridge-analytica.html. Accessed September 14, 2017.

Hampton, Keith N. and Eszter Hargittai. 2016. Stop blaming Facebook for Trump’s election win. The Hill. Available at http://thehill.com/blogs/pundits-blog/presidential-campaign/307438-stop-blaming-facebook- for-trumps-election-win. Accessed June 14, 2017.

Johnson, Eric. 2017. Full transcript: Hillary Clinton at Code 2017. recode.net. Available at https://www.recode.net/2017/5/31/15722218/hillary-clinton-code-conference-transcript-donald- trump-2016-russia-walt-mossberg-kara-swisher. Accessed September 21, 2017.

Shapiro, Jesse M. Gentzkow, Matthew, 2017. A Note on Internet Use and the 2016 Election Outcome, Brown University and NBER September 2017

Pew Research Centre, 2017, Trump, Clinton Voters Divided in Their Main Source for Election News, Journalism & Media, Pew Research Centre January 18 2017

1 note · View note