nnikesh - Tumblr blog

nnikesh · 5 years ago

Text

Coursera Week 4: Data Analysis and Visualization

In which we learn how to visualize the data we have managed! Such fun!

Explore the data using describe function. As shown below, on average, countries in the Gapminder data set have an average communication score of 24.96 per 100, with a standard deviation of 33.9 Some countries show a combined communication score > 100 because the same people have access to both cell and Internet communication.

————————————————————————

The histogram display of the combined Communication score below shows a right-skewed modality: even when access to cell and Internet communication are factored together, 110 of 194 countries have combined (reported) communication scores of zero, and approximately 40 of the 84 countries remaining have combined communication scores of less than 60 per 100.

————————————————————————

Describe HDI Indicator, input and output shown below. The mean score is 0.6 on a scale of 0-1, with a standard deviation of 0.18. This shows varance, but not at much variance as in the Combined Communication score, analyzed above. The range is .56,

The Human Development Index, below, has a less obvious shape, although perhaps you could say it is bimodal at 0.4/1 and 0.6/1. More generally, about 100 of 153 countries with information on this indicator in the Gapminder Data show a Human Development Index greater than 0.5. The countries with no data are generally the worst off, however, and about 40 countries have missing data on HDI.

————————————————————————

Connecting more than one variable

I chose communication as the independent variable.

Communication score (cell access/100 + Internet/100) is the explanatory variable→ HDI

Explanatory variable is placed on X axis; dependent variable on the Y axis

This is the basic scatterplot that results.

————————————————————————

In the second scatterplot, below, to which I have added a line of best fit, there seems to be a strong positive relationship between Communication and HDI, although the best fit line would be better if it were more of a curve than a straight line.

————————————————————————

Next step: group the explanatory variable (x). 194 countries’ data on cell phone and Internet access are split into four bins of between 48 and 49 countries each, as shown by input code and output below.

————————————————————————

Finally create the categorical to quantitative bar chart: Here I have the Communication Score divided into quartiles and displayed on the x axis; the Y axis shows Human development Index. There seems to be a more linear relationship between the two here than there was in the scatterplot representation of the relationship.

Here is a link to all the code in my program

0 notes

nnikesh · 5 years ago

Text

week 3

Coursera: Data Management and Visualization, Week 3.

Continue with the program you’ve successfully run.

I had difficulty with the programming piece last week, so I spent some time reviewing week 1 and finding my errors. Success! I got last week’s program to run (with different variables, see below). I was also frustrated with the limited data provided by the Gapminder sample for this course, because it didn’t allow me to ask the questions that I wanted to ask about the connection between access to communication and HDI (HUman Development Index), so I went to Gapminder.org and selected two additional variables: HDI and cell phone use per 100.

Decide how you will manage your variables.

STEP 1: Make and implement data management decisions for the variables you selected.

Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. Not everyone does all of these, but some is required.

I had to do some preliminary data management when I copied the data from Gapminder.org to the Coursera Gapminder data set, because there were 60 more rows of data in the information I imported: I discovered that the Coursera set had already coded out 60 countries with no information (mostly countries that no longer exist because of border changes). I removed those rows from the new data so that it matched up with the existing data.

Then I grouped two variables: Internet use per 100 and Cell use per 100. My reasoning was that access to either or both is what I want to correlate with Human Development Index. I used the simple formula:

data[‘COM_COMBO_200’] = data['INTERNETPER100’] + data['CELLPER100’]

STEP 2: Run frequency distributions for your chosen variables and select columns, and possibly rows.

The challenge of the Gapminder dataset is that it is sequential rather than categorical (strings of output data in no order with no repetition). My goal in this section was to divide the output data into quartiles and then analyze frequency (the frequency with which “Combined Communication Score” fell into first, second, third and fourth quartile, for example). I got stuck here because I could not figure out how to successfully use the pandas qcut function to divide the data into quartiles. I reviewed the sample code, looked up a bunch of help files, but was not able to successfully run the binning qcut code. I’ll keep working on it this week.

I am pasting the input and output code below with subtitles. NOTE: The output code is long because it is listing all the countries in the dataset. The frequency tables are not yet completed because of the bugs I have not yet resolved with the pandas qcut function. I welcome any feedback and help with that part of my code, so I will post it here first:

# quartile split (use qcut function & ask for 4 groups - gives you quartile split) print (“Combined Communication Access, 4 categorieS,quartiles”) sub4’[HDI_PCT’] = pandas.qcut(sub4.HDI_PCT, 4, labels=[“1=0%tile”,“2=25%tile”,“3=50%tile”,“4=75%tile”]) sub4 = ['HDI_PCT4’].value_counts(sort=False, dropna=True) print(sub4)

HERE IS THE ERROR CODE IN THE OUTPUT:

[In 51] runfile(’/Users/teacher/Dropbox/Coursera/ANALYZING DATA/PYTHON_DATA_FILES/coursera_gapminder_2.py’, wdir=’/Users/teacher/Dropbox/Coursera/ANALYZING DATA/PYTHON_DATA_FILES’) File “/Users/teacher/Dropbox/Coursera/ANALYZING DATA/PYTHON_DATA_FILES/coursera_gapminder_2.py”, line 77 'HDI_PCT4’ = pandas.qcut('HDI_2000’, 4, labels=[“1=0%tile”,“2=25%tile”,“3=50%tile”,“4=75%tile”]) ^ SyntaxError: can’t assign to literal

————————————

HERE IS THE REST OF THE INPUT CODE–THIS RAN SUCCESSFULLY, INCLUDING THE GROUPING PART.

# -*- coding: utf-8 -*- “”“ Created on Wed Oct 7 22:33:59 2015

@author: teacher ”“”

# import helper libraries

import pandas import numpy

# import the entire data set to memory data = pandas.read_csv('gapminder.csv’, low_memory=False)

# tell the program to report how many rows and columns are in dataset

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format’, lambda x:’%f’%x) print(len(data)) # number of observations (rows) print(len(data.columns)) # number of variables (columns)

# another option for displaying observations or rows in a dataframe is #print(len(data.index))

# ensure each of these columns are numeric data['CELLPER100’] = data['CELLPER100’].convert_objects(convert_numeric=True)

# set data counts (how many instances of… ) print (“rates for cell user rate per 100 by country”) c1 = data['CELLPER100’].value_counts(sort=False) print (c1)

# set percentages (what percentage of counts found above are. . ) # print (“percentage rate for internet user rate by country”) # p1 = data['internetuserate’].value_counts(sort=False, normalize=True) # print (p1)

# set data counts (how many instances of… ) print (“counts for HDI by country”) c2 = data['HDI_2000’].value_counts(sort=False) print (c2)

# set percentages (what percentage of counts found above are. . ) # print (“percentage rate for income per person by country”) # p2 = data['HDI_2000’].value_counts(sort=False, normalize=True) # print (p2)

# set data counts (how many instances of… ) print (“internet Use per 100”) c3 = data['INTERNETPER100’].value_counts(sort=False) print (c3)

# RECODING VARIABLES: Well, one thing is that the communications access is a %, # but the HDI index is a % of 1. So if I multiply the HDI by 100, # I will get comparable data. print (“HDI expressed as a percent”)

data['HDI_PCT’] = data['HDI_2000’]*100 p4 = data ['HDI_PCT’].value_counts(sort=False) print (p4)

# GROUPING VARIABLES: Make a “communications index” by adding together cell # plus internet and dividing by 2, getting a combined communication access % print (“Combined Communication Access Per 100 people”) data['COM_COMBO_200’] = data['INTERNETPER100’] + data['CELLPER100’] p5 = data['COM_COMBO_200’].value_counts(sort=False) print (p5) # some of the returned data will be > 100 because in developed countries # a lot of people have access to both cell and internet communications

——————————————————————–

HERE IS THE OUTPUT, EXCLUDING THE ERROR MESSAGE FROM THE QUARTILES CUT FUNCTION (pandas qcut), which I pasted above.

runfile(’/Users/teacher/Dropbox/Coursera/ANALYZING DATA/PYTHON_DATA_FILES/coursera_gapminder_2.py’, wdir=’/Users/teacher/Dropbox/Coursera/ANALYZING DATA/PYTHON_DATA_FILES’)

Number of rows and columns in dataset 213 19 Rates for cell user rate per 100 by country 0.000000 10 --here we see the number of countries in dataset w/ zero known cell users

0.407645 1 sequential data returns: this country has 40.7% cell users; 2.232304 1 no countries report exactly the same rate above zero 3.906133 1 (noted above), so frequency is 1 for all other countries 4.235535 1 5.183262 1 6.711056 1 7.013394 1 8.210355 1 9.217905 1 10.590927 1 0.764645 1 12.639333 1 13.294034 1 14.213866 1 0.436380 1 16.616632 1 17.567900 1 0.857225 1 1.719497 1 20.689436 1 21.873875 1 22.059434 1 23.144852 1 24.526146 1 25.355881 1 26.620226 1 2.188320 1 28.330071 1 29.048790 1 .. 0.524150 1 3.384683 1 8.348677 1 0.166614 1 0.281603 1 1.099865 1 73.804929 1 2.795386 1 1.260190 1 0.031424 1 39.968564 1 2.295012 1 0.058860 1 0.678537 1 0.215291 1 0.281655 1 53.995323 1 58.312206 1 0.212073 1 4.325262 1 26.895153 1 1.591262 1 14.083571 1 53.121388 1 15.360794 1 55.318149 1 30.438621 1 2.603997 1 5.761212 1 28.456961 1 dtype: int64

Counts for HDI (Human Development Index) by country 0.749000 3 In this output table, we see several instances of countries 0.585000 2 with the same HDI, but because in general the counts are still one, this output of frequency is still not very useful. The data is still overwhelmingly sequential. 0.665000 1 0.569000 1 0.879000 1 0.646000 1 0.488000 1 0.657000 2 0.802000 1 0.854000 1 0.754000 1 0.343000 1 0.576000 1 0.718000 2 0.774000 1 0.726000 1 0.778000 1 0.633000 1 0.669000 1 0.764000 1 0.527000 1 0.715000 1 0.378000 1 0.443000 1 0.703000 1 0.612000 2 0.856000 1 0.680000 1 0.586000 1 0.636000 1 .. 0.752000 1 0.404000 1 0.421000 1 0.602000 1 0.619000 1 0.224000 1 0.422000 1 0.800000 1 0.732000 1 0.913000 1 0.824000 1 0.398000 1 0.818000 1 0.252000 1 0.801000 1 0.736000 1 0.833000 1 0.588000 1 0.372000 2 0.674000 1 0.773000 1 0.816000 1 0.846000 1 0.275000 1 0.882000 1 0.779000 1 0.451000 1 0.357000 1 0.704000 1 0.753000 1 dtype: int64

Internet Use per 100 0.252120 1 In this output table we have no frequencies >1 of a specific 1.300470 1 percent of cell phone users per country, but we have detailed 2.870685 1 info on rates in each country. I want to be able to divide this 3.973678 1 output information into four groups and analyze frequency 0.934190 1 per quartile. For example, how many (=frequency) countries 5.980834 1 have 0-25% internet use? From that, we could derive data 6.482226 1 about the overall penetration of Internet use in the world. 7.038683 1 8.000000 1 0.000000 1 10.538836 1 0.036261 1 4.863679 1 13.870277 1 15.442823 1 16.600000 1 17.843928 1 0.230103 1 19.964202 1 0.401434 1 21.384731 1 16.113127 1 23.128822 1 0.111044 1 13.856852 1 28.321759 1 29.214739 1 30.266891 1 31.745552 1 0.071039 1 .. 2.719854 1 0.020000 1 0.231462 1 1.983373 1 1.534272 1 1.785225 1 3.573255 1 13.665851 1 0.053394 1 0.520674 1 0.491706 1 0.747631 1 3.076431 1 43.811215 1 0.017703 1 7.337695 1 13.633502 1 26.813376 1 1.053402 1 0.047023 1 0.400944 1 0.204652 1 29.718966 1 0.064081 1 3.245037 1 2.108337 1 6.357064 1 43.130141 1 10.484951 1 28.602703 1 dtype: int64

HDI expressed as a percent–here I recoded the HDI data as out of 100… turned out not to be useful. 37.100000 1 80.500000 1 57.900000 1 87.900000 1 83.300000 1 77.600000 1 86.100000 1 22.400000 1 23.000000 1 24.500000 2 25.200000 1 62.600000 1 27.400000 1 28.600000 1 30.600000 2 31.300000 1 63.400000 1 34.300000 1 35.700000 1 36.000000 1 37.400000 2 38.400000 1 39.800000 1 42.300000 1 41.000000 1 42.200000 1 43.600000 1 62.100000 1 45.100000 1 46.100000 1 .. 77.300000 1 40.400000 1 70.500000 1 87.800000 1 58.600000 1 44.800000 1 61.600000 1 22.900000 1 86.400000 1 52.800000 1 77.000000 1 42.100000 1 81.800000 1 70.100000 1 87.300000 1 57.700000 2 37.200000 2 52.300000 1 89.700000 1 83.900000 2 58.800000 1 82.400000 1 80.200000 1 58.300000 1 77.900000 1 27.500000 1 40.800000 1 85.600000 1 61.900000 1 86.900000 1 dtype: int64 Combined Communication Access Per 100 people: after exploring how the Human Development Index was created, I decided to create a simple Combined Communication Index. Separately, these rates are per 100 people, and I wondered if having a few combined communication scores of >100 was meaningless, but I decided it was not. Either form of communication (cell or internet) might be correlated with HDI, and populations that have access to BOTH seem particularly advantaged in their flexible access to information and communication, therefore having access to both DOES seem meaningful. I had originally thought I would divide the combined score by 2 so it would match the HDI measurement scale, but decided that setting the range as 0-200 was more meaningful. 0.773361 1 1.083902 1 2.360911 1 3.611213 1 0.000000 1 5.332160 1 6.334632 1 7.882997 1 8.456157 1 9.074558 1 10.697077 1 11.199699 1 12.691890 1 13.695744 1 14.615110 1 15.542000 1 16.164720 1 17.320330 1 18.590927 1 1.076393 1 20.437865 1 21.399280 1 22.387523 1 23.147308 1 24.606583 1 25.314575 1 5.241640 1 27.806690 1 3.700917 1 29.691653 1 .. 8.337351 1 1.404448 1 0.136712 1 23.176002 1 8.827646 1 81.833557 1 0.088335 1 73.861818 1 37.421119 1 81.089392 1 24.871825 1 1.093191 1 29.117566 1 2.930326 1 102.580042 1 0.866948 1 1.160070 1 0.282424 1 1.982430 1 81.883932 1 0.701244 1 110.118766 1 82.542727 1 10.330930 1 0.286330 1 4.638517 1 1.101222 1 38.045074 1 5.970001 1 10.930585 1 dtype: int64

Review Criteria

Your assessment will be based on the evidence you provide that you have completed all of the steps. When relevant, gradients in the scoring will be available to reward clarity (for example, you will get one point for submitting output that is not understandable, but two points if it is understandable). In all cases, consider that the peer assessing your work is likely not an expert in the field you are analyzing. You will be assessed equally on your description of your frequency distributions.

Specific rubric items, and their point values, are as follows:

Was the program output interpretable (i.e. organized and labeled)? (1 point)

Does the program output display three data managed variables as frequency tables? (1 point)

Did the summary describe the frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.? (2 points)

0 notes

nnikesh · 5 years ago

Text

week 2

Week 2

Create a blog entry where you post 1) your program 2) the output that displays three of your variables as frequency tables and 3) a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

Overview of this week: I got started with Python and am really excited about what I am learning. However, I ran into a BIG obstacle when I discovered that the dataset I had chosen –Gapminder– was composed of information that did not have “yes” or “no” answers in it. Anywhere. Instead, it reports rates (say, of internet usage), and ranges (say, of per capita income). So when I tried to apply the sample lesson to my data, there was no clear way to identify “frequency” of instances of data. After searching around on the Python help sites and the Pandas web sites, I did discover that there is syntax one can develop to explore ranges using subroutines, and I tried this out: see lines 61-71 for example:

pt1 = data.groupby(‘incomeperperson’).size() * 100 / len(data)

sub1=data[(data['lifeexpectancy’]>=42) & (data['lifeexpectancy’]<=38.5)] # & (data['internetuserate’])] print (“life expectancy by quartile”)

However, I am pretty inexperienced at writing code, and when I try to run the sections of code related to this I don’t get meaningful results. I have run out of time to work on resolving this this week. Alas!

Still, here is my turn in:

1) My program:

Created on Wed Oct 7 22:33:59 2015@author: teacher “”“ # import helper libraries import pandas import numpy # import the entire data set to memory data = pandas.read_csv('gapminder.csv’, low_memory=False)

# tell the program to report how many rows and columns are in dataset

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format’, lambda x:’%f’%x) print(len(data))

# number of observations (rows) print(len(data.columns))

# number of variables (columns)

# another option for displaying observations or rows in a datafram is print(len(data.index))

# ensure each of these columns are numeric data['internetuserate’] = data['internetuserate’].convert_objects(convert_numeric=True) data['incomeperperson’] = data['incomeperperson’].convert_objects(convert_numeric=True) data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True) # set data counts (how many instances of… ) print (“rates for internet user rate by country”) c1 = data['internetuserate’].value_counts(sort=False) print (c1) # set percentages (what percentage of counts found above are. . ) # print (“percentage rate for internet user rate by country”) # p1 = data['internetuserate’].value_counts(sort=False, normalize=True) # print (p1)# set data counts (how many instances of… ) print (“counts for income per person by counry”) c2 = data['incomeperperson’].value_counts(sort=False) print (c2) # set percentages (what percentage of counts found above are. . ) # print (“percentage rate for income per person by country”) # p2 = data['incomeperperson’].value_counts(sort=False, normalize=True) # print (p2) # set data counts (how many instances of… ) print (“counts of life expectancy by country”) c3 = data['lifeexpectancy’].value_counts(sort=False) print (c3)

# set percentages (what percentage of counts found above are. . ) # print (“percentage rate for life expectancy by country”) # p3 = data['lifeexpectancy’].value_counts(sort=False, normalize=True) # print (p3)

# for this variable I sorted data to identify range: 42-84 years # now I need to write code that establishes quartiles: 42 year difference # between worst and best life exp. / 4 = 10.5 print (“counts of life expectancy by quartile”) ct1 = data.groupby('incomeperperson’).size() print (ct1)pt1 = data.groupby('incomeperperson’).size() * 100 / len(data)sub1=data[(data['lifeexpectancy’]>=42) & (data['lifeexpectancy’]<=38.5)] # & (data['internetuserate’])] print (“life expectancy by quartile”)print (“counts for life expectancy”) c3 = data['lifeexpectancy’].value_counts(sort=True, dropna=False, normalize=True) print (c3)

2) the output that displays three of my variables as frequency tables

runfile(’/Users/teacher/Dropbox/Coursera/ANALYZING DATA/coursera_gapminder.py’, wdir=’/Users/teacher/Dropbox/Coursera/ANALYZING DATA’) 213 16 213 rates for internet user rate by country 0.720009 1 1.400061 1 2.100213 1 3.654122 1 4.999875 1 5.098265 1 6.497924 1 7.232224 1 8.959140 1 9.999954 1 1.259934 1 11.090765 1 12.645733 1 13.598876 1 14.830736 1 15.899982 1 90.703555 1 62.811900 1 0.829997 1 1.700031 1 20.001710 1 31.050013 1 16.780037 1 24.999946 1 25.899797 1 26.740025 1 2.699966 1 3.129962 1 29.999940 1 76.587538 1 .. 77.996781 1 51.958038 1 42.692335 1 81.000000 1 12.006692 1 2.599974 1 81.590397 1 80.000000 1 36.422772 1 39.820178 1 13.000006 1 48.516818 1 28.289701 1 9.998554 1 53.024745 1 43.055067 1 61.987413 1 7.930096 1 26.477223 1 44.585355 1 2.199998 1 53.740217 1 29.879921 1 44.570074 1 40.020095 1 2.259976 1 6.965038 1 31.568098 1 20.663156 1 28.999477 1 dtype: int64 counts for income per person by counry 2668.020519 1 5634.003948 1 6147.779610 1 772.933345 1 26551.844238 1 1543.956457 1 13577.879885 1 115.305996 1 523.950151 1 33923.313868 1 1860.753895 1 5900.616944 1 20751.893424 1 786.700098 1 275.884287 1 276.200413 1 2231.993335 1 1784.071284 1 369.572954 1 9243.587053 1 285.224449 1 37662.751250 1 544.599477 1 37491.179523 1 180.083376 1 1525.780116 1 39972.352768 1 2062.125152 1 18982.269285 1 24496.048264 1 .. 3545.652174 1 62682.147006 1 220.891248 1 952.827261 1 1810.230533 1 736.268054 1 8445.526689 1 9425.325870 1 1253.292015 1 27110.731591 1 25575.352623 1 744.239413 1 2025.282665 1 1258.762596 1 1232.794137 1 722.807559 1 5188.900935 1 32292.482984 1 495.734247 1 10480.817203 1 5528.363114 1 242.677534 1 2534.000380 1 16372.499781 1 2549.558474 1 760.262365 1 31993.200694 1 22275.751661 1 2557.433638 1 25249.986061 1 dtype: int64 counts of life expectancy by country 80.734000 1 49.025000 1 74.402000 1 74.825000 1 57.937000 1 76.546000 1 71.017000 1 73.703000 1 77.653000 1 73.737000 1 58.582000 1 74.847000 1 79.977000 1 73.126000 1 80.170000 1 75.850000 1 73.990000 1 62.465000 1 61.452000 1 73.911000 1 80.642000 1 76.640000 1 72.444000 1 72.640000 1 75.632000 1 68.823000 1 74.044000 1 79.634000 1 76.652000 1 61.061000 1 .. 57.134000 1 72.832000 1 72.283000 1 81.539000 1 80.009000 1 49.553000 1 75.956000 1 48.196000 1 53.183000 1 74.522000 1 81.855000 1 68.498000 1 69.317000 1 81.126000 1 75.057000 1 79.341000 1 76.142000 1 80.557000 1 81.439000 1 68.287000 1 79.839000 1 55.377000 1 50.239000 1 59.318000 1 78.531000 1 74.126000 1 74.788000 1 76.128000 1 72.150000 1 56.081000 1 dtype: int64 counts of life expectancy by quartile incomeperperson 103.775857 1 115.305996 1 131.796207 1 155.033231 1 161.317137 1 180.083376 1 184.141797 1 220.891248 1 239.518749 1 242.677534 1 268.259450 1 268.331790 1 269.892881 1 275.884287 1 276.200413 1 279.180453 1 285.224449 1 320.771890 1 336.368749 1 338.266391 1 354.599726 1 358.979540 1 369.572954 1 371.424198 1 372.728414 1 377.039699 1 377.421113 1 389.763634 1 411.501447 1 432.226337 1 .. 20751.893424 1 21087.394125 1 21943.339898 1 22275.751661 1 22878.466567 1 24496.048264 1 25249.986061 1 25306.187193 1 25575.352623 1 26551.844238 1 26692.984107 1 27110.731591 1 27595.091347 1 28033.489283 1 30532.277044 1 31993.200694 1 32292.482984 1 32535.832512 1 33923.313868 1 33931.832079 1 33945.314422 1 35536.072471 1 37491.179523 1 37662.751250 1 39309.478859 1 39972.352768 1 52301.587179 1 62682.147006 1 81647.100031 1 105147.437697 1 dtype: int64 life expectancy by quartile counts for life expectancy nan 0.103286 72.974000 0.009390 73.979000 0.009390 67.714000 0.004695 78.826000 0.004695 74.156000 0.004695 67.185000 0.004695 83.394000 0.004695 66.618000 0.004695 82.759000 0.004695 73.127000 0.004695 80.414000 0.004695 79.915000 0.004695 73.396000 0.004695 68.944000 0.004695 65.193000 0.004695 75.246000 0.004695 74.576000 0.004695 73.488000 0.004695 72.196000 0.004695 71.172000 0.004695 70.739000 0.004695 69.042000 0.004695 76.835000 0.004695 81.012000 0.004695 55.442000 0.004695 64.666000 0.004695 77.005000 0.004695 81.404000 0.004695 68.846000 0.004695 … 74.641000 0.004695 54.675000 0.004695 70.349000 0.004695 79.158000 0.004695 80.854000 0.004695 68.494000 0.004695 74.941000 0.004695 67.852000 0.004695 79.499000 0.004695 75.901000 0.004695 76.072000 0.004695 72.317000 0.004695 51.444000 0.004695 77.685000 0.004695 74.515000 0.004695 64.986000 0.004695 68.749000 0.004695 67.529000 0.004695 69.927000 0.004695 51.219000 0.004695 57.062000 0.004695 62.791000 0.004695 54.210000 0.004695 74.241000 0.004695 79.311000 0.004695 76.126000 0.004695 58.199000 0.004695 75.181000 0.004695 78.371000 0.004695 75.620000 0.004695 dtype: float64 //anaconda/lib/python3.4/site-packages/pandas/core/ops.py:566: RuntimeWarning: invalid value encountered in greater_equal result = getattr(x, name)(y) //anaconda/lib/python3.4/site-packages/pandas/core/ops.py:566: RuntimeWarning: invalid value encountered in less_equal result = getattr(x, name)(y)

3) a few sentences describing my frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc. I am not sure how to answer this question, since the Gapminder dataset does not lend itself to analyzing frequency of particular variables, but rather to analysis of correlation of ranges of values, e.g. % of population with Internet access and per capita income. I understand from the class forum that this is true for everyone who used Gapminder data, and that as long as we show that we wrote a program to output instances of we have met the intended goal of this week’s lesson.

0 notes

nnikesh · 5 years ago

Text

Week 1

Step 1: Identify your Data Set: Gapminder

Step 2: Identify your research question: is there a correlation between cell phone use, life expectancy and HDI

Indicator name: Use of Cell Phone (per 100 people) Definition of indicator Cell phone users are people with access to cellular telephones Source organization(s) World Bank Link to source organization http://www.worldbank.org/ Link to complete reference http://data.worldbank.org/indicator/IT.CEL.SETS.P2

Indicator name: Internet Use (per 100 people) Definition of Indicator Internet users are people with access to the Internet Source organization(s) World Bank Link to source organizationhttp://data.worldbank.org/indicator Complete referenceWorld Development Indicators Link to complete referencehttp://data.worldbank.org/indicator/IT.NET.USER.P2

Indicator name: HDI Source organization(s) UNDP Definition of Indicator: Human Development Index is an index used to rank countries by level of “human development”. It contains three dimensions: health level, educational level and living standard. Source organization(s) UNDP Link to source organization http://hdr.undp.org/en/ Complete reference UNDP Human Development Report Link to complete reference http://hdrstats.undp.org/en/indicators/103106.html

STEP 3. Prepare a codebook of your own: I add to my codebook variables reflecting access to cell, landline and internet communication.

STEP 4. Identify a second topic that you would like to explore in terms of its association with your original topic. I selected HDI, or Human Development Index (see http://hdr.undp.org/en/content/human-development-index-hdi for a full definition). This is an index that mashes together data about income, education, and life expectancy.

STEP 5. Add questions/items/variables documenting this second topic to your personal codebook. Is there a correlation between access to communication and HDI? I added Human Development Index (HDI) to codebook.

STEP 6. Perform a literature review to see what research has been previously done on this topic. Use sites such as Google Scholar (http://scholar.google.com) to search for published academic work in the area(s) of interest. Try to find multiple sources, and take note of basic bibliographic information.

Literature Review

Search strings:

access to communication and HDI, access to communication and development, cell communication and human development

1. Christian Fuchsa, “Africa and the digital divide” Telematics and Informatics, Volume 25, Issue 2, May 2008, Pages 99–116. Argue that the digital divide is a deeply structural, not just technical problem. 2. Kay Raseroka, “Access to Information and Knowledge,” Human Rights in the Global Information Society edited by Rikke Frank Jørgensen 3. Birdsall, Stephanie and William Birdsall, GeographyMatters: mapping technology access and human development, http://ojphi.org/ojs/index.php/fm/article/view/1281/1201

4. Koroma, Joseph T., Dissertation: “Geography, Poverty, and Development Policy in the New African Millennium: Monitoring the Millennium Development Goals Through Human Development,” 2008 KEY SOURCE 94-95 explore the relationship between cell phone and land line penetration and HDI and find a 95% positive relationship. Dissertation at Indiana State University, 2008, UMI # 3305416

Abstract: “The eight United Nations Millennium Development Goals (MDGs) are to eradicate extreme poverty and hunger, achieve universal primary education, promote gender equality and empower women, reduce child mortality, improve maternal health, combat HIV/AIDS, ensure environmental sustainability, and develop a global partnership for development. The eighth goal (MDG 8), aims to develop and strengthen a global partnership for development between rich and poor countries. Three main components of the latter (MDG 8) aim to accelerate the infusion of official development aid (ODA), liberalize international trade, and introduce information and communication technologies (ICT) in the hope of creating a conducive environment for the manifestation and realization of human and economic development in developing countries. Most of the MDGs have been hailed as quantifiable and time bound. Unfortunately, MDG 8, which is the ways and means goal, lacks quantifiable targets and comparable accountability. Further, the emphasis on development aid, technology, and international trade based on comparative advantage is similar to the old modernization perspective popular in 1950s and 1960s. Are the MDGs an inadvertent avenue by which to create dependency between the weak peripheral states of Africa and dominant, core countries, and their multilateral institutions? I perform regression analyses with Human Development Index (HDI), a proxy for the MDGs, as my dependent variable, and international trade, official development aid, and information and communication technologies as my independent variables in order to answer the research question.

STEP 7. Based on your literature review, develop a hypothesis about what you believe the association might be between these topics. Be sure to integrate the specific variables you selected into the hypothesis.

Hypothesis: access to communication influences HDI (Human Development Index), but is not in itself sufficient to ensure improved HDI. My hypothesis is that in some instances with access to information, HDI will not be as high as one might predict, perhaps because if extrinsic factors such as political violence.

1 note · View note